Many unassigned features using SILVA trained classifier for V1-V3 hypervariable region


I'm currently running the QIIME2-DADA2 (v. 2020.8) pipeline to process 16S paired end sequences from different hypervariable regions separately. Each was trimmed using Cutadapt and quality filtered and denoised with DADA2.

When I tried to train my own classifiers for each region, I was able to get very good feature classifications down to the species level for the vast majority of features for all regions except V1-V3. For the V1-V3 region, the vast majority are unassigned.

Just to give an idea of how badly classified: 1492/2723 are unassigned, 1225/2723 are classified to domain level, and only 6 total are down to the species level. So, essentially almost all are either unassigned or classified to domain level only.

I used the SILVA full length sequences & taxonomy files below from the Q2 data resources to train the classifier:

I followed the Q2 feature classifier tutorial to train the classifiers using primers 27FYM (5′-AGAGTTTGATCMTGGCTCAG-3′) and 519R (5′-GWATTACCGCGGCKGCTG-3′)

I've searched for issues related to unclassified features in this forum. I've tried/considered solutions from these similar posts, but they did not seem to resolve the issue I am having.




Specifically, I double checked that I was inputting the correct primer into the qiime feature-classifier extract-reads command and using dada2 paired end output. However, I have not yet tried the going back to rerun DADA2 denoising with different parameters, but this post mentioned it, but wondering if it would be worth trying in my case.

I did truncate the sequences from this region using the following parameters based on fastQC reports (screenshot below): --p-trunc-len-f 235 --p-trunc-len-r 190

Forward sequences:

Reverse sequences:

Providing this summary as well in case it is helpful: 16S human oral samples from public database, MiSeq 2x300 paired-end, QIIME2 version 2020.8, V1-V3 primers (27FYM & 519R)

I am not sure what steps I should take to resolve this. Any advice/suggestions would be greatly appreciated. Thanks very much.

1 Like

This may be a real signal, as not all variable regions provide equivalent resolution for all microbial taxa across all environments. That is, some researchers will validate which gene region best resolves microbial communities found in the environment under study.

Here are some related papers:

Another thought... :thinking: Many "full-length" sequences that reside within the various reference databases made use of the 27F primer... which means many of the references sequences may not have that primer sequence contained within them as they were likely removed prior to being deposited... or the primer sequence was simply not contained within the sequencing output. Thus trimming with any primer that is at the extreme ends of the SSU rRNA gene will likely not be found. Depending on the tool you use to search for and trim the sequences, these data maybe discarded if not found. That is, your reference database for your V1V3 may be very small and have much less reference sequence data compared to the other regions.

Have you used the full-length classifier on the V1V3 or other regions? This would be a good sanity-check to make sure that the lack of classification is not due to the poor resolution of V1V3 for your data. That is, if you observe reasonable output with the full-length classifier, then you'll know there is an issue with your V1V3 database curation. By the way, using the full-length classifier for amplicon regions is valid.


Thanks very much for your prompt response and suggestions.

To clarify, do you mean not extracting reads and instead using the full length sequence for classification? If yes, I did try the full-length classifier for the V1V3 region but not for the others. In this attempt, I had used the nb classifier trained on SILVA 138 99% OTUs full-length sequences available in the Q2 data resources though and the classifications did not improve; they were equally poor. Would you recommend I try something different here?

1 Like

Assuming your reads are not in a mixed orientation (you can search the forum for more threads on this issue), I'm not sure what else to suggest here.

But as I mentioned earlier, if all the other regions look okay, then it may be that V1V3 region does not contain the resolution needed for your study system. Perhaps manually BLAST a few sequences to be sure? :man_shrugging:

Just curious, have you tried BLASTing some of those unassigned reads to see what they hit? The point here being, just wanting to do a sanity check to make sure the reads are true bacteria and not host contaminants which is common in low biomass samples.


Thanks for your question and the suggestion. I tried BLASTing some of the unassigned reads and unfortunately I only got host DNA matches. I also did a comparison with another region I trained classifiers for just to double check and had bacteria matches with that region.

Thanks! So they match well to some other target? Or are they just non-bacteria?
I don't have anything more useful to offer than @SoilRotifer unfortunately, but is it possible that this set just has a lot more host contamination? So, is the source DNA from all of your regions the same? Or did each of these regions you sequenced go through their own separate DNA extraction, amplification etc? I just wonder if something happened with this run specifically that caused excessive host DNA to be released.

1 Like

Great questions.

Yes, they seem to match well to host DNA mostly. I have attached the rep-seq files for the region in question and another from V3V4 in case your curious to look.

rep-seqsV1V3.qzv (845.8 KB)
rep-seqs-V3V4.qzv (1.6 MB)

Yes, this is what I'm thinking too.

The source DNA is not exactly the same across the regions as in each region contains sequences from studies that collected some kind of oral sample (e.g. buccal, tongue, saliva, etc.). I hope I've understood the question correctly. All of the sequences were pulled from NCBI's SRA and grouped by hypervariable region to be run through the qiime2 pipeline. The sequences grouped to the V1V3 region though happen to all be from one study. But for other regions, there are some regions that have samples from multiple studies. But yes, for the most part each region's sequences have gone through separate DNA extraction and amplification. I hope this makes sense.

Thanks @el502,
That's interesting. The fact that there are a lot of host DNA contaminants in this one particular study makes me think perhaps there was something subpar with their DNA extraction/amplification process. Maybe it was just a super low biomass sample and their primer picked up a lot more human targets.
What is the tissue type being analyzed in this case? I've seen similar issues with mouse intestinal issues when really strong bead-beating steps are utilized.

Apologies for the late reply. Thanks for your interest.

These are oral mucosa tissues. I think there may have been something subpar as you mentioned. But the authors did use blastn searches against other databases to find the best match to assign taxonomy to. But this would be a different approach than the one we've used for the other data from other regions.

1 Like

Got it!
Depending on their collection/extraction methods, oral mucosa could potentially have lots of host contaminants. So, my best guess here is that this particular set just has lots of non-bacterial reads. But of course there's no way to be sure and everything @SoilRotifer mentioned earlier could just as easily be the real culprit too. :man_shrugging: