Reduced Number of Taxa Identified?

So I've been processing a subset of demultiplexed paired end sequences through QIIME2 and DADA2 from a larger set of sequences that were previously processed to an OTU table via QIIME1.

I imported the sequences, joined their paired ends, denoised using DADA2, and came up with a resulting frequency features table as well as a representative sequences file. From here I was interested in producing a biom table with taxonomic labels at the species level so I used one of the pretrained Naive Bayes classifier and applied it to my representative sequences file.

qiime feature-classifier classify
--i-classifier gg-13-8-99-515-806-nb-classifier.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza

I tested the classifier classifiertest.tsv (110.7 KB)

And then collapsed the taxa

qiime taxa collapse
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-level 7
--o-collapsed-table table-l7.qza

Exported and converted it to tsv

qiime tools export table-l7.qza --output-dir exported-l7-table
biom convert -i exported-l7-table/feature-table.biom -o exported-l7-table/feature-table.tsv --to-tsv

And got the resulting tsv file... feature-table.tsv (15.0 KB)

It looks like only about 100 distinct classifications were made out of my sequences and it's really low compared to what was generated when these sequences were run with QIIME1. I believe the OTU clustering de novo via QIIME1 had identified ~300 taxa and I'm curious as to whether I did something incorrect in my workflow and therefore less were identified or if this is a product of higher resolution identification via DADA2 and using ASVs?


1 Like

Hi @Pauline_Trinh,

Yes! This is a very typical result — and as they show in the dada2 article itself, ASVs are much better than OTUs at replicating the expected number of sequences in a sample (at least in simple mock communities). OTUs tend to be rather noisy, especially depending on what other processing steps precede/follow OTU picking (e.g., quality filtering, chimera filtering).

Now I am quoting published results on expected sequences, not taxa, but the same trend would be expected, since many of those noisy OTUs, and especially any chimera in the mix would probably classify to unique taxa.

100 distinct taxa down from 300 does not sound too extreme, but you could check out the numbers of sequences input/output from dada2 and review your trimming procedures to decide if you are losing too many sequences, resulting in loss of rare taxa, and need to adjust your parameters.

You could also look at the abundance distribution of those ~300 taxa that were identified from OTUs — you will probably see a long tail of low-abundance taxa, and it is these taxa which are more likely to be missing in your current analysis (since spurious OTUs are going to fall into this low-abundance zone).

I hope that helps! :sun_with_face:


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.