QC/benchmark Qiime2 on Mock community

Qiime2 is a great tool. Before we put it into our production, we would like do some QC and benchmark on mock community. So we purchased zymo mix, (ZymoBIOMICS Microbial Community Standards | ZYMO RESEARCH), prepared DNA (used two different library prep kits) and whole cells (used two different DNA extraction methods and one library kit). Based on the user guide of Zymo, although all should have 12% of each bacterial component, the theoretical composition should be expected at Pseudomonas aeruginosa (4.2%), Escherichia coli (10.1%), Salmonella enterica (10.4%), Lactobacillus fermentum (18.4%), Enterococcus faecalis (9.9%), Staphylococcus aureus (15.5%), Listeria monocytogenes (14.1%), and Bacillus subtilis (17.4%). However, my qiime2 results on mock community are quite different from what I expected. Basically, I followed the “Moving Pictures” tutorial, here are key steps I did

  1. After I de-multiplex our paired-end (PE) reads, I have noticed the quality of data is not that great (see paired-end-demux.qza), therefore I treated our PE reads as single end, set --p-trunc-len 120 and did dada2 using:
    qiime dada2 denoise-single --i-demultiplexed-seqs demux.qza --p-trim-left 0 --p-trunc-len 120 --o-representative-sequences single_rep-seqs-dada2.qza --o-table single_table-dada2.qza --verbose

There are more than 75% read left after dada2, but also apparently there are lots of amplicon sequences variants among my mock community (see attached single_table_dada2.qzv).

  1. For taxonomic analysis, I followed “moving pictures” using a pre-trained Naive Bayes classifier and the q2-feature-classifier plugin. This classifier was trained on the Greengenes 13_8 99% OTUs, where the sequences have been trimmed to only include 250 bases from V4 region of the 16S, which is the same region of our amplicon. As expected, it shows a much more complex community than my input mock community (see taxa-bar-plots.qzv).

Any one has insight on this? I am sure someone have already done this kind of exercise. Would you mind share with me parameters you used for dada2 denoise-single step? Thanks,Gary
taxa-bar-plots.qzv (1.1 MB)
single_table-dada2.qzv (774.5 KB)
paired-end-demux.qzv (282.8 KB)

1 Like


I spend a lot of time working with mock communities, so I think I can shed some insight.

Short answer: your reads are very short (120 nt) so the results you are getting actually look quite good, all things considered.

Long answer:

The underlying problem: mock communities never yield 100% matches between expected and observed compositions. Much of this is outside of the scope of bioinformatics, and hence no bioinformatic optimization can improve that performance. E.g., DNA extraction bias, primer bias, amplification/sequencing bias, copy number variation will all cause deviations that are absolutely outside of your control once the community is sequenced. So whatever zymo mix says on the box, expect very different results.

Instead, the way to use mock communities with bioinformatic methods/pipelines is to assess relative performance. E.g., if you want to see how well QIIME2 performs, don’t look at the absolute compositions, rather compare these results to those from another pipeline, or tweak the methods/parameters that you are using in QIIME2 and compare. For example, you can see how we used mock communities to evaluate taxonomy classifier performance.

How should you compare methods/workflows? q2-quality-control has a few methods, e.g., evaluate-composition that is intended precisely for this application. That method scores the accuracy of taxonomic composition accuracy for each sample (e.g., the same sample processed in different ways and then merged into one table for comparison).

Your mock community results actually look pretty good in a sense. The problem is that many of your ASVs are only classifying to family level, and even a lot that only classify to order level. Hence, most of your expected species (e.g., Staphylococcus, Listeria, and Bacillus, a total of around 47%) wind up classified to Bacilli order or Bacillaceae family at pretty close to the expected proportion. Your Enterococcus is classified to Enterococcaceae, your Lactobacillus is in there (albeit both are at lower proportions than expected), and you have Pseudomonadaceae and Enterobacteriaceae, also at slightly skewed proportions. These skewed proportions are maybe 2-fold differences from expected, but not 10-fold.

So all in all, I’d say you’ve done a pretty good job at reconstructing the expected composition, you just aren’t getting very deep (e.g., genus or species-level) assignments. This is most likely because your reads are very short, 120 bp, so you are losing a lot of sequence information that a longer read would contain — I usually see reasonably reliable genus-level assignments for 250 bp V4 sequences. This may also be caused by the specific organisms (they cannot be differentiated very well below order or family level), and possibly (but probably not) your method/parameter choices. Since you have a mock community you could play around with different taxonomy classification methods in q2-feature-classifier and/or their parameters to see how this impacts classification accuracy, but I expect you should see similar results to what we found in the paper I linked to above. You could also try a different reference database (e.g., SILVA) to see how that impacts your results (in my analyses I find there is not a major difference, but you may find otherwise). The issue is the length of the reads, so you have pretty limited information for squeezing out deeper taxonomic assignments and honestly additional optimization is probably more effort than it’s worth.

By the way, for testing, comparing, and optimizing bioinformatic pipelines you can re-use pre-existing mock communities. We have a collection of mock communities available for these purposes if you are interested.

I hope this helps shed light on this issue!


Thanks for sharing your insight. I owe you a beer.
Actually, our raw PE reads is 300bp. When I check the quality plot, I noticed the quality is not that great. Therefore for dada2 denoise-single, I chose the parameters: --p-trim-left 0 --p-trunc-len 120; for dada2 denoise-paired, I chose the parameters: --p-trim-left-f 0 --p-trim-left-r 6 --p-trunc-len-f 200 --p-trunc-len-r 120. It ends up not many sequences left from dada2 denoise-paired. Kind of hard to find a sweet spot, therefore I was using the length at Q30 to determine --p-trunc-len. Looks like overkill, Do you have any recommended parameter for dada2 PE and SE in my case? thanks again. Gary

PE_table-dada2.qzv (377.6 KB)
paired-end-demux.qzv (282.8 KB)

In your original post it looked like the results you shared were from the single-end workflow, in which case the reads are only 120 nt long after truncation.

I have been in this position many times… it happens. You can be a little more flexible, e.g., go down to Q20, without compromising read yield/quality too much (I haven’t extensively benchmarked, but this is just based on my prior observations). So:

Both are not looking great but I would give these a try:
PE: forward 200, reverse 160
SE: 220

and see how the yield compares to SE 120 (and if yields are ok, go on to check out mock community results).

Of course at the end of the day you will still not get perfect results, as described in my initial response. But you should probably get most of these down to genus level or at least family level if the seq quality is okay.

I hope that helps!

Hi, Nicholas, Thanks for your help. I followed your suggestion and ran the evaluate-composition plug-in, but got these error message. Any idea? Thanks, Gary

Qiime1_L4_tax_hdf5_biom.qza (36.8 KB)
Expect_L4_tax_hdf5_biom.qza (4.8 KB)

(qiime2-2018.2) $ qiime quality-control evaluate-composition --i-expected-features EDGE_Q1_L4_tax_hdf5_biom.qza --i-observed-features EDGE_Q1_L4_tax_hdf5_biom.qza --o-visualization qc-mock-3-comparison.qzv

Plugin error from quality-control:

Requested level of 5 is larger than the maximum level available in taxonomy data (4).

Debug info has been saved to /tmp/qiime2-q2cli-err-rzhrfnyo.log
(qiime2-2018.2) [email protected]:~/scratch/Shannon/Benchmark$ \

(qiime2-2018.2) [email protected]:~/scratch/Shannon/Benchmark$
(qiime2-2018.2) [email protected]:~/scratch/Shannon/Benchmark$ qiime quality-control evaluate-composition --i-expected-features Expect_L4_tax_hdf5_biom.qza --i-observed-features Qiime2_PE_L4_tax_hdf5_biom.qza --o-visualization qc-mock-3-comparison.qzv
Plugin error from quality-control:

Requested level of 5 is larger than the maximum level available in taxonomy data (4).

Debug info has been saved to /tmp/qiime2-q2cli-err-qhn0rgy8.log
(qiime2-2018.2) $ more /tmp/qiime2-q2cli-err-qhn0rgy8.log
Traceback (most recent call last):
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2cli/commands.py", line 246, in call
results = action(**arguments)
File "", line 2, in evaluate_composition
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/qiime2/sdk/action.py", line 228, in bound_callable
output_types, provenance)
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/qiime2/sdk/action.py", line 424, in callable_executor
ret_val = self._callable(output_dir=temp_dir, **view_args)
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_quality_control/quality_control.py", line 69, in evaluate_composition
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 100, in _evaluate_composition
results, vectors = _compute_per_level_accuracy(exp, obs, metadata, depth)
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 157, in _compute_per_level_accuracy
exp_collapsed = _collapse_table(exp, level)
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 295, in _collapse_table
table.columns, index=table.columns, name='Taxon'), level)
File "/opt/apps/Anaconda3/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_taxa/_method.py", line 27, in collapse
(level, max_observed_level))
ValueError: Requested level of 5 is larger than the maximum level available in taxonomy data (4).

(qiime2-2018.2) [email protected]:~/scratch/Shannon/Benchmark$

I just created a new entry " Plugin error: quality-control evaluate-composition" in the user forum
thanks Gary

qc-mock-Q2_PE-comparison.qzv (535.3 KB)
qc-mock-Q2_SE-comparison.qzv (552.1 KB)
Hi, Nicholas,
Your evaluate-composition plugin works great! I quickly generated results. The purpose of this study is to compare the community composition (using different wet lab protocols) between mock (expected) and Qiime2 results (observed tax composition). I have used both PairEnd: forward 200, reverse 160 and SingleEnd: 220 based on your suggestion of my previous qiime2 run (only using SE 120bp). attached are results in qzv files.
I have two questions here;

  1. The graphic in qzv results looks like summarized from average of all samples under different tax level. If i want to compare different experiments, I guess i just need plug to Excel for downstream analysis
  2. Does my result look good? If not, do you have any recommendation on how to improve it?
    I also attached my dada2 --verbose output here. DADA2_verbose.txt When i convert biom file (taxonomic relative composition) to qza file, i used qiime tools import --type 'FeatureTable[RelativeFrequency]' --source-format BIOMV210Format, is that correct?

thanks a lot,

1 Like


Awesome, a perfect application!

Yeah, in practice I wind up doing the same thing right now (download the csv and open in R). I had considered adding options to compare by group, etc, but the downstream needs/possibilities are really just too complicated to squeeze into a single method. Comparisons across a single “group” variable will probably be added in the future — or making these results an artifact that can be analyzed with different methods in this plugin. This plugin is still in active development so this may change (we are always open to suggestions).

Looks like you are getting really high false-positive detection (misclassifications). You might want to try a higher precision classifier (e.g., increase the confidence parameter for classify-sklearn)… though don’t be surprised if you start getting more false negatives/underclassifications. But since this averages across all samples it is likely that some of the methods you are comparing are performing much better than others — I would take a look at that before changing methodology.

That looks correct — if it worked, you should be fine.

Actually i just found out----not all my processed samples are from mock samples. After i removed these none-mock samples, my results look very good. qc-mock-Q2-SE-L6-comparison.qzv (316.4 KB)
qc-mock-Q2-PE-L6-comparison.qzv (315.8 KB)
thanks you, Nicholas,

1 Like

Awesome! Looks great!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

Hi @hotblast,
Note that a bug was found in quality-control evaluate-composition that caused TAR and TDR scores to be reversed. See this announcement for more details if you are affected.