Greengenes2 unclassified reads resolved by degenerate base change (H→N) in cutadapt 806R primer — why would this affect classification but not rep-seq statistics?

Hello! I am running the classification step on 16S rRNA data (V4 region) on some frog skin swab data using qiime2-amplicon-2024.10 (conda). I used the Greengenes2 pretrained classifier from the qiime2 website (2024.09.backbone.v4.nb.sklearn-1.4.2.qza), which I have successfully used on other datasets (tadpole gut microbiome, freshwater, etc) using the same version of qiime2. However, most of the reads showed up as unclassified or only classified to the kingdom level. Here is the code I ran:

qiime feature-classifier classify-sklearn
--i-classifier 2024.09.backbone.v4.nb.sklearn-1.4.2.qza
--i-reads qiime2_outputs/rep-seqs.qza
--o-classification qiime2_outputs/greengenes.taxonomy.qza

And a screenshot of the taxabarplot (set to level 5)

For context, SILVA (silva-138-99-515-806-nb-classifier.qza) successfully classified the same data just fine. After looking into it more, there were two things that changed between this analysis and previous ones I have done (aside from it being a different sample type). (1) We switched recently to the newest miseq i100 for sequencing. However, my labmate just used the same classifier on some bird data from this machine without issue. (2) I adjusted my cutadapt step to have H instead of N for the reverse primer:

Cutadapt code that resulted in poor classification with greengenes2, but worked with SILVA:

qiime cutadapt trim-paired
--i-demultiplexed-sequences qiime2_outputs/paired-end-demux.qza
--p-front-f GTGCCAGCMGCCGCGGTAA
--p-front-r GGACTACHVGGGTWTCTAAT
--p-adapter-f ATTAGAWACCCBDGTAGTCC
--p-adapter-r TTACCGCGGCMGCTGGCAC
--p-discard-untrimmed
--o-trimmed-sequences qiime2_outputs/paired-demux-trimmed.qza

I changed this step back to the old primer inputs and suddenly the classification worked using the same qiime2 version and pre-trained greengenes2 classifier.

Cutadapt code that worked with greengenes (reverting to N in --p-front-r and V in --p-adapter-f):

qiime cutadapt trim-paired
--i-demultiplexed-sequences qiime2_outputs/paired-end-demux.qza
--p-front-f GTGCCAGCMGCCGCGGTAA
--p-front-r GGACTACNVGGGTWTCTAAT
--p-adapter-f ATTAGAWACCCVNGTAGTCC
--p-adapter-r TTACCGCGGCMGCTGGCAC
--p-discard-untrimmed
--o-trimmed-sequences qiime2_outputs/paired-demux-trimmed_v2.qza

Screenshot of the taxabarplot (set to level 5)

What is confusing is that the denoising stats and rep-seqs files are fairly similar between the two runs. Sequence count, read lengths, and all percentiles are essentially the same (134,319 vs 134,360 total sequences; similar length percentiles; same min of 220 bp and max of 428 bp) and primers appear to be trimmed in both cases. The denoising stats are all within 5% of the number of reads retained. The feature sequences also appear to be similar when comparing the two rep-seqs.qzv files directly.

Really confused as to what went wrong – pretty new to microbiome analysis and would appreciate any insights into what might have happened.

2 Likes

Hello Emily,

Welcome to the forums! :qiime2:

I changed this step back to the old primer inputs and suddenly the classification worked using the same qiime2 version and pre-trained greengenes2 classifier.
...
denoising stats and rep-seqs files are fairly similar between the two runs

Spooky! Well, let's start here.

Can you use https://view.qiime2.org/ to open up the rep-seqs.qza from both the successful N run and the failed H run and view the top 2 or 3 most abundant ASVs side by side?

I suspect that while the dada2 stats may be similar, the actual ASVs will be just a little bit different and that's causing the problem for the classifier.

Let me see what you find inside those two files!

Hi Colin,

Thank you so much for the helping with this!

I checked rep-seqs.qza files and realized that the forward primer was still present in some of the forward reads, but not all of them. Re-ran cutadapt and realized that my forward reads have a lot of read through and when that is happening, the reverse complement was being trimmed from the end, but the forward primer was left on. So the forward primer was not trimmed in all of the reads. The reverse reads didn’t have much read through, so it doesn’t appear to have affected them. I re-ran cutadapt in two steps (1) primers, (2) reverse complements and that fixed things. Not sure why the classification was better with changing the degenerate base in the reverse primer before, but now it is working great with either version of the primer. Our newer sequencing results have longer reads (2 x 300 bp) than previous analyses, so it’s an issue we hadn’t encountered before.

Thanks again for your help!

1 Like