Questions about Open Reference OTU Picking

Nicholas_Bokulich · January 30, 2018, 2:52pm

You will want to use the clustered seqs and clustered table. A description of the "new ref seqs" is given in the help documentation for that command (and in the plugin description).

The order of taxonomy classification does not really matter because in both the qiime1 and qiime2 tutorials these are independent downstream analyses and the taxonomy/diversity data are always used separately (this is not always the case, e.g., if you want to perform diversity analyses on a feature table collapsed by taxonomy, but that's a niche case and probably not what you are trying to do.)

So the order you list there is fine — but the order listed in the moving pictures tutorial is fine, too, because the taxonomic information is not used in the diversity analyses (and vice versa).

You do bring up a good point, though, with this comment:

The fewer sequences you have for classification and alignment, the faster these steps will be. So you can filter at each step of the way. You could add the following steps (numbered to fit in between steps in your list):

0.0) summarize your feature table and generate alpha rarefaction curves to check out how many reads you have per sample, and make sure you have reasonably good coverage in these samples/determine a good threshold for filtering out low-abundance samples.

0.25) remove samples with fewer reads than the cutoff.

0.5) filter OTUs based on abundance (low abundance OTUs are often erroneous, so whenever using OTU picking methods I would advise a small abundance filter, e.g., minimum 10 reads to be retained). Remember to use filter-seqs to remove sequences from your representative sequences that are no longer present in the feature table.

0.75) Remove chimeric sequences if you haven't already (another step I'd always recommend with OTU picking). Don't forget to use filter-features and filter-seqs again (those steps are in the linked tutorial too but just saying for completeness here)

1.5) Now that you have assigned taxonomy to your sequences, you could remove sequences that were unassigned if you like. Or other sequences that you don't want in there, e.g., if you have sequences that hit chloroplast or mitochondrial sequences you probably want to remove those before additional steps. Don't forget to use filter-features to also remove these from the feature table (using the reference sequences as a metadata input to only retain sequences present in that file).

Really confusing, right? The issue here is that there are so many different ways to slice and dice one's data, and the order often does or does not matter, depending on user preference and the in(ter)dependence of different arms of analysis (e.g., taxonomy vs. diversity analyses). There is often not a "right" way, either.

Above all, though, your questions are really helpful in guiding us as we work to improve the documentation. 2017 was focused on building up essential core features in QIIME2, and 2018 will bring improvements to the documentation to clarify many of these steps and many of the cool features hidden away in QIIME2. It's a work in progress.

I hope that helps!