QIIME2-amplicon-2025.7 16S PacBio classification issues/discussion

Hi,

Apologies in advance for the long post. So I recently tried the GTDB226 classifier for the PacBio 16S full-length data. I am using the qiime2-amplicon-2025.7 version. Based on this reference - How to train a GTDB SSU classifier using RESCRIPt - Here is the script I used to generate the GTDB 226 classifier that is compatible with qiime2-amplicon-2025.7.

Generate_GTDB226_Classifier.txt (1.3 KB)

After I generated the data, I noticed around 697 ASVs that were only classified as “d__Bacteria”. I am not very surprised but there is this one ASV 8559b4a519962a4a022d270379a16656 that had an overall abundance of around 72,000 reads across all the samples. Here is the representative sequence for this ASV. Its length is 1,434 bases.

ASV RepSeq.txt (1.4 KB)

Now if I simply search or BLAST this ASV Representative sequence against the GTDB sequences, it matches perfectly with two sequences

  1. >RS_GCF_002885135.1~NZ_PJKM01000003.1 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiia;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia massiliensis
  2. GB_GCA_937980385.1~CALJPV010000086.1 d__Bacteria;p__Actinomycetota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__Collinsella sp937980385 [location=5203..6711] [ssu_len=1509] [contig_len=7155]

Here is its details along with the sequence. Length for both the sequences is 1,509 bases.

GTDB226_Ref_Seq.txt (3.4 KB)

I have a very basic question or topic of discussion here,

If we have a representative sequence that matches almost perfectly with one of the sequences in the classifier database, and that sequence has a proper annotation within the database, why does the sequence get annotated as “d_Bacteria”? And that too with a confidence score of 1. Is that because there are two sequences in the reference database that are exactly the same and they match with this ASV and that is causing the confusion within feature-classifier? The command I used to classify is -

qiime feature-classifier classify-sklearn
--i-classifier ~/qiime2_2025_7/gtdb_classifier_r226.qza
--i-reads dada2-ccs_rep_filtered.qza
--o-classification dada2-ccs_classified_filtered_GTDB226.qza
--verbose
&> DADA2_classifier.log

How should I interpret this and move forward? Any help or suggestions will be greatly appreciated. Thank you.

That seems like it to me. My suspicion is that the classifier is seeing equivalent hits to two different phyla for this sequence, so assigning it one level up which in this case is d__Bacteria. It might be worth reaching out to GTDB on their forum to see if they think one of these might be a misannotation. It seems like it would have to be. (For what it's worth, NCBI BLAST hits Akkermansia.)

4 Likes

Hi Greg,

Thank you for responding. I had the same thing when I BLASTed the representative sequence. I removed the d__Bacteria;p__Actinomycetota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__Collinsella sp937980385 sequence, re-generated the classifier and now it annotates the sequence as Akkermansia. I raised the issue on the GTDB forum as well - Misannotation in the GTDB226 Ref Database - Bacterial Taxonomy - GTDB Forum

Thank you

2 Likes

Sounds good @ankurnaqib. My one word of caution is that we don't know at this point that the Collinsella assignment is wrong - it seems highly likely that it is, but it's also possible that GTDB inherited a misannotation from NCBI, ... So just be cautious about proceeding with your modified reference database until you have some conclusive evidence (or ideally a new version of GTDB, so you don't have to justify your modification of the database to manuscript reviewers/etc). If you think of it, we'd love to have a follow-up when you hear back from GTDB, just to hear their perspective on it and so anyone who finds this post in the future knows how it was resolved. Good luck!

1 Like

Hi Greg. Yes, we had the same concern or atleast warning bells in our mind. The reason (other than the NCBI Blast and checking with other sources) we decided on deleting the “misannotation” from the database was:-

  1. another group in Amsterdam saw the same issue in a different dataset. They had sequences from an isolate so were surprised on seeing so much being annotated not even to the genus level.
  2. Akkermansia is pretty common and expected in the environment we are working in. So biologically, it made sense.

I did receive the response from the GTDB forum. Their reponse says that they expect such errors and cant do much about it. Here is a link to the topic page - Misannotation in the GTDB226 Ref Database - #3 by ankurnaqib - Bacterial Taxonomy - GTDB Forum

I am also sharing the response here:-

“Hi,

Thank you for your email. Unfortunately, the 16S rRNA sequence provided on the GTDB FTP site are not QC’ed in any manner. This means one should expect such errors to exist as incorrect binning of 16S sequence is not uncommon. We are working on a tool to QC large marker gene DBs that will address this issue, but this is several months away from being completed. In the meantime, you might consider using a 3rd party 16S DB with GTDB annotations such as provided by IDTAXA: DECIPHER - Downloads .

Cheers,
Donovan”

Thank you.

2 Likes

That's good to know - thanks for sharing @ankurnaqib!

And there might be more such sequences for sure. What I do on just a basic level is extract representative sequences of all the ASVs that are not annotated to even phylum level, I BLAST them against the core_nt database. That gives me a slight understanding of whether we are missing something big. Most of the times, the unannotated sequences are uncultured bacterium or host sequences. Taking this one sequence at a time. :grinning_face: Thank you for your support.

1 Like