OTU picking questions

I am running our dataset in QIIME2. After the meeting with one of our statisticians, he asked some questions I do not know how to answer it correctly. Please help me with the following questions:

  1. The so-called 100% OTUs are still analyzed as OTUs; this part of dada2 can be considered an OTU-picking method. This is all important because it is not clear that the 100% OTUs are the best. For example, if you want to use taxonomic assignments, the dada2 OTUs that differ by a single base pair will all be assigned the same taxonomy. Not sure having twice as many OTUs that differ by 1-2 bases is preferable?
  1. Doing some OTU post-processing after dada2 might be a compromise; this would basically be using dada2 as a quality-control step (so that reads that contain rare bases would be discarded as errors while common bases are kept). So for example, send the sequences that characterize the OTUs from dada2 into a (closed) OTU-picking program and pool accordingly.

Thanks, Bing

Hi Bing,

On 1: The “100% OTUs” output by dada2 (I call them “sequence variants” or SVs) are analyzed much like OTUs. However, it is likely that you will actually find fewer SVs than you previously did OTUs. While there are multiple SVs lumped together within some 3% OTU, the dada2 method has a lower false-positive rate than the most common OTU methods such as uclust or average-linkage clustering, and in general I have seen a significant reduction in the total number of features.

On 2: If the biological phenomenon of interest is at higher taxonomic scales, it is of course useful to analyze those data at those higher scales (eg. perhaps you are interested primarily in the ratio of Bacteroides to Firmicutes). In that case SVs can be grouped together on the basis of taxonomy (how I usually do it) or used as input to an OTU method with some pre-determined threshold. I’m not sure if that type of OTU picking is implemented yet in QIIME2, although one of the Q2 experts can clarify there. I would recommend starting at the most-resolved level (SVs) unless you have prior knowledge that the higher taxonomic levels are where the actions is though, as there can be significant functional differences between bacteria with similar 16S sequences.

On 1 again: There are some qualitative advantages of SVs over OTUs that we think are important to be aware of in the areas of reproducibility, reusability and comprehensiveness. We have posted a preprint outlining these arguments that may be worth reading and/or sharing with your collaborator: http://www.biorxiv.org/content/early/2017/03/07/113597

4 Likes

In that case SVs can be grouped together on the basis of taxonomy (how I usually do it)

You can do this in QIIME 2 with qiime taxa collapse to create a FeatureTable[Frequency] at a specific taxonomic level.

or used as input to an OTU method with some pre-determined threshold. I'm not sure if that type of OTU picking is implemented yet in QIIME2, although one of the Q2 experts can clarify there.

I believe the only "OTU" picking methods available in QIIME 2 at this time are DADA2 and Deblur (available in q2-dada2 and q2-deblur, respectively). These both create "100% OTUs".

@Bing, if you or someone you know is interested in developing a QIIME 2 plugin for other OTU picking methods, there are instructions for developing a QIIME 2 plugin, and we're happy to assist with any questions!

2 Likes

There may be good reason for considering even strain level variation (which could be at the level of 1-2 bp differences between sequence variants), depending on your hypothesis. If you look at this paper from 2011 (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026732), you will see that the strain level differences that they identified revealed clinically and epidemiologically important features that would have been missed had they “just” stuck with classification at the genus or even the species level.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.