Dada2 in a mock community

Gluque · October 20, 2017, 1:27pm

Hi! I have just started with Qiime 2. I have a mock community with 21 members (Mockrobiota #14). I used Dada2 to get a feature table which shows 77 variants. Shouldn't be this value near 21? The relative frequencies of each the detected features seem also not to be concordant with what is expected.

I am listing the commands I have used so far. I would really appreciate your help.

Best,

Guillermo Luque

qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path fastq \
--source-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path paired-end-seq.qza

qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired-end-seq.qza \
--p-trunc-len-f 240 \
--p-trunc-len-r 240 \
--o-representative-sequences rep-seqs.qza \
--o-table table.qza

Nicholas_Bokulich · October 20, 2017, 1:44pm

Dear @Gluque,
Thank you for posting!

If you check out the original article for this mock community, you will see that the mock community is actually composed of a mixture of 3 different amplicons; hence around 21 * 3 = 63 variants would be expected. The additional possibilities of multiple copy number variation, contamination, sequencing error, and the fact that no method is ever perfect sum up to mean that dada2 is actually performing really well in this case. @benjjneb do you have any other thoughts on this?

That's a common issue with mock communities. Even in very carefully composed mock communities, amplification/sequencing bias, human error, and a host of other issues mean that nothing's perfect. Unfortunately, that's just biology! If you are attempting to, e.g., benchmark a particular method, my advice is to compare the relative performance of different methods on each mock community that you test; you can see an example in this preprint. Because some mock communities are noisier than others (due to the reasons above), the absolute performance of a method on different mock community datasets cannot really be compared (e.g., if you wanted to compare your results to those reported in the literature for a different method on a different mock community). That also means that just because your method appears to perform poorly (not reconstruct the expected composition perfectly), it may actually be performing much better than you think and you need to assess relative performance.

In future releases of QIIME 2 we will be adding methods for assessing accuracy of mock community data, so stay tuned for more details!

I hope that answers your questions!

Gluque · October 20, 2017, 2:54pm

Thank you so much for your quick answer. In my lab, we plan to upgrade our 16S pipeline to Qiime 2 and your feedback is (and will be) really valuable for us.

Best!

Nicholas_Bokulich · December 1, 2017, 7:05pm

Just to follow up, the mock community assessment methods mentioned in this thread are now available as new actions evaluate-composition and evaluate-seqs. These are designed with mock communities in mind, but could also be useful for testing simulated communities or other samples types with an “expected” composition/sequences.

I hope these help!