Help! Too many taxa in mock community - how to interpret or how to deal with it?

Dear QIIME2 community,

Situation: we added mock communities to our 360 freshwater lake samples

Observation: For DADA2 as for Deblur and for each of the primer set that were used, the 10 expected genera of the mock community are almost always the top 10 abundant observed genera. That is good. However, regardless the method or primer set, we get a large tail of medium to low abundant, misclassified false positives. Compared to other mock community examples we found here in the forum, our tail of additional taxa is very large.


  • how to interpret too many taxa in our mock community results?
  • what could be the cause? Contamination? Low amount of DNA?
  • should we filter the actual samples based on the mock results?

taxa-bar-plots-deblur-all-mock.qzv (429.3 KB)

using qiime quality-control evaluate-composition for each primer set (1) 515 and (2) 799:
comparison-mock515.qzv (355.2 KB)

comparison-mock799.qzv (348.8 KB)

We are thankful for any feedback we get!

background information:
We added 4 replicates of mock communities to our samples. Some samples, including the mock communities were sequenced with two different primer sets (1) 515F-Y and 926R, and (2) 799F and 1193R.
As mock communities, we used the ATCC® MSA-
1000TM mix, which contains an even abundance of 10 different bacteria genera.
Samples for both primers were demuxed and de-noised individually with DADA2 and Deblur. For each primer set, a classifier using SILVA 132 99% was trained with qiime feature-classifier fit-classifier-naive-bayes.

Since the results for DADA2 and Deblur are in principle very similar, here the steps using Deblur to denoise and only for one primer set:

qiime demux emp-paired \
  --m-barcodes-file sample-metadata-mock799.tsv \
  --m-barcodes-column BarcodeSequence \
  --p-no-golay-error-correction \
  --i-seqs emp-paired-end-sequencesRun5A.qza \
  --o-per-sample-sequences demux-mock799.qza \
  --o-error-correction-details demux-details-mock799.qza

qiime vsearch join-pairs \
  --i-demultiplexed-seqs demux-mock799.qza \
  --o-joined-sequences Taxonomy_deblur_mock/demux-joined-mock799.qza

qiime quality-filter q-score-joined \
  --i-demux Taxonomy_deblur_mock/demux-joined-mock799.qza \
  --o-filtered-sequences Taxonomy_deblur_mock/demux-joined-filtered-mock799.qza \
  --o-filter-stats Taxonomy_deblur_mock/demux-joined-filter-stats-mock799.qza

qiime demux summarize \
  --i-data Taxonomy_deblur_mock/demux-joined-filtered-mock799.qza \
  --o-visualization Taxonomy_deblur_mock/demux-joined-filtered-mock799.qzv

demux-joined-filtered-mock799.qzv (302.9 KB)

qiime deblur denoise-16S \
  --i-demultiplexed-seqs Taxonomy_deblur_mock/demux-joined-filtered-mock799.qza \
  --p-trim-length 311 \
  --p-sample-stats \
  --p-no-hashed-feature-ids \
  --p-jobs-to-start 12 \
  --o-representative-sequences Taxonomy_deblur_mock/rep-seqs-deblur-mock799.qza \
  --o-table Taxonomy_deblur_mock/table-deblur-mock799.qza \
  --o-stats Taxonomy_deblur_mock/deblur-stats-deblur-mock799.qza

qiime feature-classifier classify-sklearn \
  --i-classifier classifier-specific.qza \  ## <-- specifically trained for 799F and 1193R
  --i-reads rep-seqs-deblur-mock799.qza \
  --p-n-jobs 15 \
  --o-classification taxonomy-deblur-mock799.qza

## the same was done for the samples sequenced with the 515F-Y and 926R primer

qiime feature-table merge-taxa \
  --i-data taxonomy-deblur-mock515.qza \
  --i-data taxonomy-deblur-mock799.qza \
  --o-merged-data taxonomy-deblur-all-mock.qza

qiime feature-table merge \
  --i-tables table-deblur-mock515.qza \
  --i-tables table-deblur-mock799.qza \
  --o-merged-table table-deblur-all-mock.qza

qiime taxa barplot \
  --i-table table-deblur-all-mock.qza \
  --i-taxonomy taxonomy-deblur-all-mock.qza \
  --m-metadata-file sample-metadata-all-mock.tsv \
  --o-visualization taxa-bar-plots-deblur-all-mock.qzv

Hi @timpiel, welcome to the QIIME 2 community!

I will start with few basic observations about mock communities:

  1. we almost never see perfect replication of expected results!
  2. exogenous contamination (e.g., in reagents, library preps) and cross-contamination (e.g., from other samples in your sequencing run) are common issues leading to observation 1
  3. Index hopping and other technical errors can also lead to spurious detection of false-positives in your samples
  4. Primer bias and other issues can seriously skew the expected relative abundances (which fortunately does not seem to be an issue here).

Your data actually look pretty good in that you have TDR=1.0 (i.e., 100% recovery of all expected organisms) and pretty good R2 values at level 6 (indicating that at the genus level the abundances of your 10 mock community members are more or less observed at the expected levels).

So the problem is, as you say, a long tail of low-abundance species.

I would bet cross-contamination and index hopping are the main causes in your case; possibly also some library prep/reagent contamination.

No! This is very likely from cross-contamination and index hopping, so the false-positives you see are organisms from your real samples. Filtering these would severely skew your real samples.

QIIME 2 does not have any methods currently implemented to handle index hopping, but there are some methods out there, e.g., in R, that attempt to discover index hoppers — give those a look! We have a nice discussion of some of these on the forum:

You could also try the R package decontam (not yet implemented in QIIME 2, but will be one day soon). This will help identify any exogenous contaminants, though not cross-contaminants.

I am very glad to see that you are using q2-quality-control for this… this will give you a good comparison to demonstrate magnitude of improvement after applying other filtering techniques.

Give those a spin and please do share your results!