Questions about Open Reference OTU Picking

Sara_Jeanne08 · January 30, 2018, 4:13am

Hi Nicholas,

Thank you very much for providing me with the correction - I appreciate it. I was able to import the taxonomy data.

I am working through the rest of what I need to do in order re-analyze my data with QIIME 2. In QIIME 1 - I ran open reference OTU picking. I was able to successfully run this in QIIME 2 with VSearch but the QIIME 1 script wrapped many of the steps / tools - clustering, alignment, classification and generating a phylogenetic tree with the one pick_open_reference_otus.py script. I am trying to repeat what I did in QIIME 1 but I need a little help in making sure I do everything in the correct order and with the proper output files from each preceding step.

Open Reference OTU picking with VSearch:

`(qiime2-2017.12) bash-3.2$ qiime vsearch cluster-features-open-reference --i-sequences ./20180127_pilot_rep-seqs-denovo_and_ref_nonchimeric.qza --i-table pilot_chimera_filtered_table.qza --i-reference-sequences ../QIIME2/SILVA_16S/99_otus.qza --p-perc-identity 0.99 --p-threads 6 --o-clustered-table ./20180129_pilot_VSearch_OpenRef_99per_Clustered_Table.qza --o-clustered-sequences 20180129_pilot_VSearch_OpenRef_99per_Clustered_Seqs.qza --o-new-reference-sequences 20180129_pilot_VSearch_OpenRef_99per_New_Ref_Seqs.qza --verbose`

Output:
Saved FeatureTable[Frequency] to: ./20180129_pilot_VSearch_OpenRef_99per_Clustered_Table.qza
Saved FeatureData[Sequence] to: 20180129_pilot_VSearch_OpenRef_99per_Clustered_Seqs.qza
Saved FeatureData[Sequence] to: 20180129_pilot_VSearch_OpenRef_99per_New_Ref_Seqs.qza

I am not quite certain what the difference is between the two output sequence artifact files. Could you possibly clarify? Which one is to be used for the next step in the analysis process - the clustered seqs or the new reference seqs?

In QIIME 1 Open Reference OTU picking classification of the clustered OTUs was the next step but in the "Moving Pictures Tutorial" it is listed much later - after diversity analyses. Wouldn't I want to repeat the next steps of analysis for open reference OTU picking in the following order:

Classify Clustered Seqs / OTUs - Includes training the classifier with the Silva reference sequences and taxonomy.
Alignment of sequences - MAFFT
Generate phylogenetic tree with aligned sequences

(3.5: I normally filtered my biom file output from the pick_open_reference_otus.py script based on OTUs that were taxonomically unassigned or I would remove other OTUs or samples based on my review.) - but I am still figuring out when do these types of filtering with QIIME 2)

Complete Alpha and Beta Diversity Analyses

Thank you again for your time and help with this. I could not find a full Open-Reference OTU Picking tutorial / pipeline suggestion. I am trying to replicate most of what I did in QIIME 1.

Thank you very much,

Sara

Nicholas_Bokulich · January 30, 2018, 2:52pm

Hi @Sara_Jeanne08,

You will want to use the clustered seqs and clustered table. A description of the "new ref seqs" is given in the help documentation for that command (and in the plugin description).

The order of taxonomy classification does not really matter because in both the qiime1 and qiime2 tutorials these are independent downstream analyses and the taxonomy/diversity data are always used separately (this is not always the case, e.g., if you want to perform diversity analyses on a feature table collapsed by taxonomy, but that's a niche case and probably not what you are trying to do.)

So the order you list there is fine — but the order listed in the moving pictures tutorial is fine, too, because the taxonomic information is not used in the diversity analyses (and vice versa).

You do bring up a good point, though, with this comment:

The fewer sequences you have for classification and alignment, the faster these steps will be. So you can filter at each step of the way. You could add the following steps (numbered to fit in between steps in your list):

0.0) summarize your feature table and generate alpha rarefaction curves to check out how many reads you have per sample, and make sure you have reasonably good coverage in these samples/determine a good threshold for filtering out low-abundance samples.

0.25) remove samples with fewer reads than the cutoff.

0.5) filter OTUs based on abundance (low abundance OTUs are often erroneous, so whenever using OTU picking methods I would advise a small abundance filter, e.g., minimum 10 reads to be retained). Remember to use filter-seqs to remove sequences from your representative sequences that are no longer present in the feature table.

0.75) Remove chimeric sequences if you haven't already (another step I'd always recommend with OTU picking). Don't forget to use filter-features and filter-seqs again (those steps are in the linked tutorial too but just saying for completeness here)

1.5) Now that you have assigned taxonomy to your sequences, you could remove sequences that were unassigned if you like. Or other sequences that you don't want in there, e.g., if you have sequences that hit chloroplast or mitochondrial sequences you probably want to remove those before additional steps. Don't forget to use filter-features to also remove these from the feature table (using the reference sequences as a metadata input to only retain sequences present in that file).

Really confusing, right? The issue here is that there are so many different ways to slice and dice one's data, and the order often does or does not matter, depending on user preference and the in(ter)dependence of different arms of analysis (e.g., taxonomy vs. diversity analyses). There is often not a "right" way, either.

Above all, though, your questions are really helpful in guiding us as we work to improve the documentation. 2017 was focused on building up essential core features in QIIME2, and 2018 will bring improvements to the documentation to clarify many of these steps and many of the cool features hidden away in QIIME2. It's a work in progress.

I hope that helps!

Sara_Jeanne08 · January 30, 2018, 4:11pm

Hi Nicholas,

Thank you for your time and help with my questions.

I read the information regarding the reference seqs outputted from the Open Reference OTU picking but I guess I am not fully understanding how they are used - is it if you want to re-analyze or run a second round of clustering instead of the SILVA reference database sequences?

Thank you for helping me sort through my proposed data analysis pipeline - I appreciate that you are working on the documentation. I have been having to look in many places to find which commands and tools I need to use for my analysis. Is there a place that goes over general qiime 2 commands for viewing your summarized data instead of going through the web portal? Like the command - biom summarize-table in QIIME 1.

Also regarding taxonomy filtering-

I often have preformed a lot of work on my data set based on taxonomy levels - often viewing sample differences at the family, genus or species level. Is there a way to generate plots and summary biom tables per level of taxonomy in QIIME 2 (I.e like the summarize_taxa_through_plots.py script in QIIME 1)?

Thank you again for all your time and help with this. It is greatly appreciated.

Sara

Nicholas_Bokulich · January 30, 2018, 4:30pm

Hi @Sara_Jeanne08,

The idea is that if you want to compare a future sequencing run (let's call it dataset 2) to this dataset (dataset 1) without going back to re-analyze everything, you would use the "new ref seqs" as the reference sequences for that new dataset. Then the same exact OTU clusters (with the same IDs) would be generated for that dataset, so that it could be merged with and directly compared to dataset 1.

QIIME2 has a number of commands for summarizing data, similar to the biom summarize-table command. E.g.,

Feature Tables: feature-table summarize
Sequences: feature-table tabulate-seqs
Metadata: metadata tabulate

Is that what you are looking for?

taxa barplot effectively does the same thing but with fewer moving parts! The moving pictures tutorial includes this to give you an idea of how to go from feature table to barplot. The QZV is interactive, so you can select how samples are labeled by metadata and sorted, the level of taxonomy you wish to view, how bars are colored, etc.

If you want to generate feature tables at different taxonomic levels for further analysis (e.g., ANCOM to test for significant differences in taxon abundance), you will need to use taxa collapse separately for each level.

I hope that helps!

Sara_Jeanne08 · January 30, 2018, 4:36pm

Hi Nicholas,

Thank you again for the help, clarifications and the quick reply!

Regarding viewing the summary data:

I have heard there is a command called view. I guess what I am looking for is a quick way to view the summary stats for my samples after each step via the command line / terminal instead of going through the QIIME2 webportal with the .qzv file

Thanks so much!

Sara

Nicholas_Bokulich · January 30, 2018, 4:54pm

Ah, understood. Yes, you can use:
qiime tools view insert-name-of-file.qzv

To open up a visualizer in a browser window.

I hope that helps!

Sara_Jeanne08 · February 1, 2018, 5:00pm

Hi Nicolas,

I have one other question about summarizing QIIME 2 data:

What is the input file / file type that is used for the metadata tabulate command -

Usage: qiime metadata tabulate [OPTIONS]

Generate a tabular view of Metadata. The output visualization supports
interactive filtering, sorting, and exporting to common file formats.

Options:
--m-input-file MULTIPLE PATH Metadata file or artifact viewable as
metadata. This option may be supplied
multiple times to merge metadata. The
metadata to tabulate. [required]

Should I be using my map file from QIIME 1 in my QIIME 2 analysis? I only used it to demultiplex my data in QIIME 1 and then imported only the seq.fna file. I admit when I used summarize table in the earlier steps in QIIME 2 - to look at the non chimeras the characters in the table did not make any sense to me.

Thank you very much,

Sara

Nicholas_Bokulich · February 1, 2018, 5:25pm

Hi Sara,

The input should be any metadata file ("mapping file"). Yes, a qiime1 metadata file should work.

This is discussed in some of the tutorials with examples, e.g., here. The summary contains the same (and more) sample information that biom summarize-table would provide on a biom table (e.g., from qiime1): number of samples, number of features, frequency of each feature, sequences per sample.

Your feature table is just a biom table contained inside a QIIME2 artifact, so you can just export your feature table to biom and use biom summarize-table on the output. Comparing that output to the output of feature-table summarize could be a good way to orient yourself as you transition to QIIME2.

I hope that helps!