Taxonomic Classification Only Reaching Genus Level (6) – Need Help with Deeper Classification

Dear QIIME 2 Community,

We are currently analyzing paired-end 16S sequencing data from the V3–V4 region using QIIME2 (version 24.10). We performed taxonomic classification using both SILVA 138.1 nr99 and GTDB_220 databases, which we pre-trained in a region-specific manner with qiime feature-classifier fit-classifier-naive-bayes.

Here are the steps we followed in our QIIME 2 workflow:

!qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path ../FASTQ-Dateien
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path paired-end-demux.qza

!qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux.qza
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r GACTACHVGGGTATCTAATCC
--p-error-rate 0.05
--p-cores 30
--p-discard-untrimmed
--o-trimmed-sequences demux-trimmed-v3v4.qza

!qiime demux summarize
--i-data demux-trimmed-v3v4.qza
--o-visualization demux-trimmed-v3v4-summary.qzv

!qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed-v3v4.qza
--p-trunc-len-f 250
--p-trunc-len-r 200
--p-n-threads 30
--o-representative-sequences rep-seqs-dada2-v3v4.qza
--o-table table-dada2-v3v4.qza
--o-denoising-stats stats-dada2-v3v4.qza

!qiime metadata tabulate
--m-input-file stats-dada2-v3v4.qza
--o-visualization stats-dada2-v3v4.qzv

!qiime feature-classifier classify-sklearn
--i-classifier ../Classifier/silva-138.1-ssu-nr99-classifier-v3v4.qza
--i-reads 16S-rep-seqs-v3v4.qza
--o-classification 16S-taxonomy-v3v4.qza

The Problem

When we inspect our taxonomic classification results, we mostly reach only genus level (Level 6) and occasionally species level (Level 7).

As a reference, we tested the Nextflow-based AmpliSeq pipeline, and using the same dataset, we were able to classify taxa down to subspecies (Level 9).

Our Suspicions

  1. Read quality & truncation in DADA2
    Could the read length trimming in DADA2 be affecting the classification depth?
    What is the best way to determine the optimal truncation length for DADA2? Is there an automated way to trim based on Phred scores instead of manual truncation?
  2. Classifier training & database limitations
    Are there any best practices for improving classifier accuracy to reach deeper taxonomic levels?
    Would it help to use a different feature-classifier approach?

We have tested multiple parameter adjustments, but we are still struggling to improve our classification depth in QIIME 2 compared to AmpliSeq. Any suggestions or insights would be greatly appreciated!

Many thanks in advance!

Best regards

Hi @nico,

Just a quick drive by comment here...

It is generally not possible to obtain species and subspecies level classification with 16S rRNA amplicons, especially with variable regions like V4 etc... It is even difficult to do this for some taxa when you have the full 16S rRNA gene sequence.

SILVA (currently up to version 138.2), as far as I am currently aware, still only curates taxonomy to the genus level, even if they provide species level information. Mainly for the reasons I mentioned above.

:bar_chart: Here are some links to the topic that discuss (in part, or in full) that over-, under-, and mis-classification of taxa is a real issue encountered by many tools and curation practices (there are many more articles on this topic):

I am not familiar with AmpliSeq, or how their reference database was curated. My off-the-cuff response would be that many species, and sub-species classifications you are observing, are likely over-classifications and not real. Just because a classifier returns a species-level classification does not mean they are necessarily correct.

:thinking: My 2 cents... , I'd caution against obtaining species specific taxonomic classification with short amplicon reads. If you truly require species and strain level identification, then I'd suggest looking into shotgun metagenomic sequencing.

I'm sure others will have thoughts to add. :slight_smile:

6 Likes

Hi @nico and @SoilRotifer,

Mike has already said most of what I'd say, but more nicely!

That is a fantastic list of references and I need to add a few to my TBR pile! :books:

If I can offer one more classic about 16 S rRNA classification, I also still go back to Wang et al, 2007 discussing taxonomic resolution based on fragment length.

I think the only piece of insight for improving classification I might add past what @SoilRotifer has suggested is that sometimes I use environment-specific databases when species really matters. (Vaginal microbiome is my most frequent application.) You may want to use them with a grain of salt, because as Mike keeps reminding me, having out groups is super important and environment-specific databases don't always do a good job of having out groups.

It's been a long week, I have a :tea:, and so let me ask a philosophical question that would make my literature teachers eyes roll so far back in their head they'd come around and miss the kids doing their math homework in the back of the classroom. Shakesphere asks if a rose by any other name would smell as sweet. If you'll let me paraphrase, is there a reason that an ASV with a species label is inherently better than one without? The fact that you can slap a name on something doesn't mean the name is helpful, useful, or accurate. I would submit that precision without accuracy is more dangerous than accuracy without precision.

You can analyze your data at a community level with relatively imprecise taxonomic inference. A family level plot is often good enough to assess that you actually sequenced what you expected to sequence. (I'd even posit that having more precision here makes the graph less readable. You don't need to decorate teh sequences with names to get alpha or beta diversity. You can do differential abundance testing without names as well, although I do agree that having some kind of taxonomic information improves inference. But, again, depending on your system of study, the percision of that inference may be varying degrees of helpful. Very few microbes are E. coli, and there was a fantastic preprint a couple of months ago about how most microbes are under studied in the literature. I suggest reporting your ASV (I tend to use a modified hex code), providing the externally valid ASV sequence, and describing them in the text as "An ASV mapped to an unclassified genus in Lachospiracae (Lachno-a132f)" or something.

If you have specific organisms you need to identify because they're critical to your hypothesis, I think techniques like qPCR are great options. If you community level hypothesis is species focused, I agree shotgun is the way to go. But, otherwise, embrace the ambiguity.

Best,
Justine

3 Likes

Dear @SoilRotifer and @Justine,

Thank you both so much for your detailed and insightful responses! Your explanations and references have been incredibly helpful in putting our expectations into perspective.

One of the most important takeaways for us is the clear limitation of short-read 16S sequencing in achieving species-level classification. We initially found it surprising that nf-core AmpliSeq provided results down to level 9, but based on your explanations, it seems likely that this represents overclassification rather than truly reliable species and subspecies assignments. This is an important realization for our analysis.

Ultimately, given that we are currently working with a MiSeq system and do not have access to shotgun metagenomics, we will likely have to accept a lower level of taxonomic resolution. However, your responses have really helped us see that this is not necessarily a major limitation—community-level analysis and family/genus-level classifications still provide meaningful biological insights.

The links and discussions you've shared have been extremely valuable in reframing our approach to taxonomy assignment. We truly appreciate your time and expertise in guiding us through these considerations. Thanks again for helping us see things from a different angle!

Best,
Nico

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.