Annotation and Classifier question

Hi,

I am trying to understand something here. I have a sequence in the ref database that has this sequence:-

Seq.txt (1.7 KB)

In the taxonomy, its labeled as

RS_GCF_041226625.1~NZ_JBCLUQ010000046.1 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiia;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila

I ran this command:-
qiime feature-classifier classify-sklearn --i-classifier /home/qiime2_2025_7/gtdb_classifier_r226-edited_v2.qza --i-reads dada2-ccs_rep_filtered_v2.qza --o-classification dada2-ccs_classified_filtered_v2.qza --verbose &> DADA2_classifier.log

I was just looking at the output and I have an ASV that I am confused about. The ASV is is d5ebbbf0f104adc463156fca557bb6fc with annotation as d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Muribaculaceae. When i extracted the representative sequence, it matches the RS_GCF_041226625.1~NZ_JBCLUQ010000046.1 perfectly (barring the primer regions). Here is the file that contains those details.

ASV_Taxa_RepSeq.txt (1.6 KB)

I am not able to understand that if the representative sequence of the ASV matches with the sequence in the database, and the database annotation says Akkermansia, why is it getting annotated as d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Muribaculaceae? Should I be changing any parameters? Any help or advice would be appreciated. Thank you.

Hi @ankurnaqib,

Although there is an exact match to that representative sequence, the key question to ask is how many other reference sequences exactly match? Often the case is, multiple other reference sequences, with differing taxonomy, have the same match. Thus, the classifier is retuning the Lowest Common Ancestor (LCA) taxonomy as it is unable to disambiguate among several exact matches.

I was able to determine that you likely used the --p-db-type All when downloading the GTDB database to make your classifier. I downloaded the corresponding file from GTDB here. If you open a text editor and have it search for exact matches to your ASV, you will find that there are 223 exact sequence matches to different taxa. These exact matches are:

d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;...
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Muribaculaceae;...
d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiia;o__Verrucomicrobiales;f__Akkermansiaceae;...
d__Bacteria;p__Bacillota;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;...
d__Bacteria;p__Bacillota;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;...
d__Bacteria;p__Actinomycetota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;...
... etc ...

Thus, the resulting LCA taxonomy of common hits is being returned.

-Cheers!

2 Likes

Thank you for responding. I thought of the same as well. But when I searched within the database, no other sequence matched with that sequence. There is only one sequence in the database that matched with this sequence. I am attaching the dna-sequences.fasta and the taxonomy file here.

Row 83093 and 83094 are an exact match to this sequence. And this representative sequence does not match any other sequence in the file. Or am I thinking it all wrong? Any advice would be really helpful. Thank you.

I am unable to download the files at the moment. What are these files you are sharing? Are they the GTDB files? If so, I've already looked at them via the GTDB site directly. I think there were 2 sequences with exact matches in the combined Archaea and Bacteria SpeciesReps files (from here), but there are 223 matches in the All file, linked in previous reply.

How did you construct the reference database? What commands did you use? Please provide all commands and their options that resulted in the gtdb_classifier_r226-edited_v2.qza file.

Finally, the file name makes me suspect that there were edits to the files prior to making the classifier.

Okay, I was able to grab the files you shared. Where is this file from?

I contains substantially less data than the ALL file I can pull from GTDB, and much more than the combined SpeciesReps files from Archaea and Bacteria.

Were these files filtered / edited?

-Mike

Hi Mike,

Yes, the database was edited. We first removed one sequence that was an exact replicate of another sequence. I wrote about that on the QIIME Forum as well as the GTDB forum. We then removed 56 more ASVs that we found to be misannotated based on our isolates data and also based on blast results. The script that I used to generate the database is here:-

GtDB_Classifier_Script.txt (3.3 KB)

Do you think this process that we undertook is wrong? Thank you so much for looking into this.

Ankur

Thank you for sharing this @ankurnaqib

I think you can skip the filter-seqs-length-by-taxon step. Or you can simply use filter-seqs-length to select the smallest size you expect from your amplicon. Also, there are no Eukaryotes in the GTDB files. I'd try skipping this and see what happens. But definitely keep the cull-seqs step. Anyway, I think the length filtering, with the specified options, might be removing too much reference data.

Another option, is to try using the qiime rescript extract-seq-segments approach. That is, use your data, or quality sequence data that you trust, to align and extract similar reads from the GTDB reference sequences. This will obviate the need for length filtering.

There is no need to export and re-import data into QIIME 2. You can simply use the built-in functions qiime taxa filter-seqs ... and qiime rescript filter-taxa ... to remove unwanted sequences and taxonomy from each of the files. Also, you can make use of qiime rescript edit-taxonomy ... to edit problematic taxon labels. This way, everything you do is recorded in provenance and it is clear to others what you've done. Heck if you are able to confidently assert that a taxonomic label should be something else, then you should be able to simply replace the label.

Keep in mind, much of the SILVA database tutorial is not meant to be an SOP. Just a series of examples that were organized for simplicity. As mentioned throughout the tutorial, alter the curation steps as needed for your use case.

I'd strongly suggest that you also try assigning your taxonomy to SILVA 138.2. Just as a matter of sanity checking some taxonomic assignments. I will say that, for some projects, I will run my data through SILVA first, to identify and remove host, mitochondria, and chloroplast reads. Remove these reads from the data, then re-classify using GTDB. GTDB, in my experience, will incorrectly assign bacterial/archaeal taxonomy to things that are very clearly mitochondria, etc...

2 Likes

Hi Mike,

Thank you for your response. I think till the time i use the ‘uniq’ method to dereplicate, the ASV I am trying to investigate, will always annotate as Muribaculaceae, even though I know it is Akkermansia because I did a NCBI BLAST on the representative sequence. Since last night, I even tried generating the classifier using the ‘majority’ method. That solved that issue of ASVs being annotated as Muribaculaceae even though they are Akkermansia. But it brought another issue, a few ASVs that were previously with the ‘uniq’ method being annotated as d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Phocaeicola;s__Phocaeicola vulgatus are just being annotated as d__Bacteria. I am attaching the script I used to generate this version of the classifier. I wasnt able to incorporate all the changes you had mentioned but that should change the annotation.

GtDB_Classifier_Script_majority.txt (1.3 KB)

We have tried SILVA before but the amount of data that is unclassified or unassigned at the species level is very high. I should mention that the data that we are working with is PacBio full-length 16S (27F-1492R primer set) so there is a higher expectation of getting more species level resolution in comparison to SILVA. Do you think I should still can skip the filter-seqs-length-by-taxon step? Or should I just use the filter-seqs-length wherein in can use the --p-global-min as 1200 and --p-global-max as 1600. Do you think that is a feasible idea? Thank you so much for your help.

Ankur

But how many top hits are equivalent and hit other taxa? Remember BLAST will arbitrarily sort top hits that are equivalent. But in this case, it appears that most of the top 500 hits are to Akkermansia using general megablast search, with some other different near neighbors. But using the 16S/ITS db w/ megablast close hits from other taxa are observed too. But again, as I've referenced above there are over 200 exact matches to other taxa in GTDB. So, the tools is telling you that it cannot disambiguate these differences. Perhaps these are database annotation issues? I do not know. You could try constructing (clawback) / downloading environment specific classifiers from SILVA.

Again, it depends on how often close matches are hit and contribute to the taxonomic assignment. Also, many databases might suffer from taxonomic and/or other curation biases, which can alter the assignments. See Figure 5 of the RESCRIPt paper. Which is why, it's a good idea to try different approaches.

I'd avoid the majority method as it does not work like most people assume. See here. At least make sure here are no hybrid taxonomy strings. This method assumes that the annotations are of high quality and rigorously curated to avoid hybrid taxonomy string outputs. But as you set --p-perc-identity 0.99, that might be okay, and mitigate hybrid taxonomy strings. Though, there are many different taxa that might share exactly the same sequence but have differing taxonomies (as we are observing with your query against GTDB above) . Hence the uniq mode. You could try altering the --p-confidence of classify-sklearn and see if that helps. Probably, also try classify-consensus-vsearch or classify-consensus-blast too?

There are many clades that cannot be disambiguated at the genus / species level even if you have full-length 16S rRNA gene sequence. This is expected for the 16S rRNA gene regardless of reference database used. This is why SILVA explicitly does not curate to the species level taxonomy. In fact, we only provide "species" labels as an option in RESCRIPt, and we warn about using this option in the SILVA tutorial. In a nutshell, if you do observe many species-level assignments, this is likely a problem of over-classification. At least I'de be wary... even with full-length 16S rRNA gene sequences. Remember, the 16S rRNA gene is actually a very conserved gene overall. If you'd like to obtain high quality species / strain level assignments, then the use of other markers and/or shotgun metagenomics should be used.

Finally, do not base the apparent quality of a reference database simply because it returns more or less genus and/or species level annotations, these could be signs of under-/over-classification problems. Also, this is why RESCRIPt exists, you can fetch reference data from any database and curate to your needs in order to reduce the assignment problems you are referring to.

Yeah, taxonomy assignment is not as easy as many think. :slight_smile:

I am not sure I have any further ideas at the moment. Perhaps other people smarter that I have insights? But please do keep us posted on what works.

1 Like

Hi Mike,

So we have a good news and one more question. I removed the filter-seqs-length-by-taxon step and that seems to be doing the trick. I did not realize this and its interesting to see that so much good sequences come from these low length reads in the reference database. I then dereplicated using the majority method while keeping the percent identity as 0.99. The classifier we have now is far more accurate than whatever we have had so far. So thank you so much for helping with that.

I have one other question. If we want to change the annotation of a particular ASV or a few ASVs based on specific ASV id in the database itself so that when it is called the next time, it is called with the updated annotation, how can we achieve that? I am aware that I cannot change the classifier itself. I have to change the annotation in the taxonomy qza file and the corresponding annotation in the sequences qza file and then re-train the classifier. But is there a way to do that using rescript? Thank you.

Ankur

Yeah, this is why I tried to make it clear for users to process they way that works best. Ultimately, I really should update that tutorial. For example, I follow the approach as outlined in this figure for my Snakemake pipeline, I also have a Nextflow variant here. I need to clean those up too. :grimacing: Perhaps I should reference these pipelines in the SILVA tutorial? :thinking:

Yep, I linked this earlier in the thread, see qiime rescript edit-taxonomy .... You only need to edit the taxonomy file. :slight_smile:

Hi Mike,

Thank you for sharing that. Yes, I think these should be referenced in the SILVA tutorial.

Also, thank you for sharing the qiime rescript edit-taxonomy link as well. But I believe this is changing basically all instances of a taxa to a different name, like for example that example will change all instances of g__Salmonella; to g__Escherichia-Shigell.

Is there a way I can edit taxonomy of just a few specific ASVs based on their ASV ids? And I would also need to train the classifier again then? I hope my understanding is correct. Thank you.

Ankur

1 Like

:+1:

Sadly no... I should add an option to specify an feature-IDs. One round-about way would be to:

  1. run qiime rescript filter-taxa --m-ids-to-keep-file ... on the taxonomy file twice.... Once to remove the feature-IDs (read the --p-include and --p-exclude help text), from the main taxonomy file and write out a new taxonomy file without those reads. Let's call this taxonomy-minus.qza.

  2. Then run qiime rescript filter-taxa --m-ids-to-keep-file ... again, to write a file that only contains the taxonomy of the given feature-IDs that you'd like to edit. We'll call this, taxa-to-edit.qza. You'd edit this latter taxonomy file as you normally would with qiime rescript edit-taxonomy, and call this output edited-taxa.qza.

  3. Then you can merge the edited taxonomy file back to the taxonomy file that you moved them from by using:

    qiime feature-table merge-taxa \ 
     --i-data taxonomy-minus.qza edited-taxa.qza \
     --o-merged-data updated-taxonomy.qza
    
  4. Then make your classifier with this taxonomy file.

Note, we split and merge so that we do not have duplicate feature-ids with differing taxonomy. Again, this is a bit onerous.. but it will keep everything in provenance.

-Cheers!

1 Like

Oh wow. That is great. I didnt think that we could use include and exclude that way. Thank you so much for that. But there is one other thing I noticed. The feature table (Table_Features_Taxa.xlsx) has the regular MD5 ASV ids. But the taxonomy file (gtdb-226-nonspecies-rep-tax-dereplicated.qza) and the sequences (gtdb-226-nonspecies-rep-seqs-cleaned-filtered-dereplicated.qza – not able to attach this due to size restrictions) file that I used to create the classifier do not contain those ASV ids. Now I have no way to make a crosswalk or map the ASV ids from the feature table to the taxonomy or sequence file.

Table_Features_Taxa.zip (1.5 MB)

gtdb-226-nonspecies-rep-tax-dereplicated.qza (2.5 MB)

I am sorry that I keep digging and keep getting stuck every 10ft. I hope I will finally reach where the water is!

Here is the script I used to generate the classifier.

GtDB_Classifier_Script_majority_updated.txt (998 Bytes)

Thank you.

Ankur

The trick is to perform the editing just after you download the GTDB sequence and taxonomy files.

qiime rescript get-gtdb-data \
    --p-db-type All \
    --p-url-type Mirror \
    --output-dir ./gtdb-226-all \
    --verbose

The original IDs from GTDB are still there even after you dereplicate and/or cull the sequences. Thus, you should be able to export any of those files and double check:

cd ./gtdb-226-all

qiime tools export \
    --input-path ./gtdb_taxonomy.qza \
    --output-path ./gtdb_taxonomy_export

qiime tools export \
    --input-path ./gtdb_sequences.qza \
    --output-path ./gtdb_sequences_export

You can also make a visualization that combines the information of both of these files as shown below. Note, it is easier if the data are dereplicated first. Otherwise it might take a while to load the QZV, if at all, using the full data.

# using default settings for dereplication
qiime rescript dereplicate \
    --i-sequences ./gtdb_sequences.qza \
    --i-taxa ./gtdb_taxonomy.qza \
    --o-dereplicated-sequences ./gtdb_sequences_derep.qza \
    --o-dereplicated-taxa ./gtdb_taxonomy_derep.qza \
    --p-threads 8 \
    --verbose

qiime metadata tabulate \
    --m-input-file ./gtdb_taxonomy_derep.qza ./gtdb_sequences_derep.qza \
    --o-visualization ./gtdb_tax_seq_derep_refdata.qzv 

Then you can carry out the edit-taxonomy operations I mentioned above prior to doing any other curation. Then you can make your classifier.

1 Like

Hi Mike,

I am so sorry that I wasnt able to explain properly in my previous post. So for example, I have this ASV id 13a49b93b59ab38bc6aa1da8e3d370a5 that I get once I classify the data and generate a feature table. This is the ASV that I want to edit the taxonomy for. The issue is that I am not able to match this ASV id with the GTDB classifier because the ids in the GTDB classifier are labelled like this - GB_GCA_000712235.1~KL544024.1. How do I match these ASV ids with the GTDB ids? Or is there a way to create a crosswalk that shows that a particular ASV id 13a49b93b59ab38bc6aa1da8e3d370a5 is matched with the GB_GCA_000712235.1~KL544024.1 in the GTDB classifier for example? I tried searching or matching the representative sequences with the sequences in the GTDB sequences file but could not match them. Thank you so much for all the help.

Ankur

Your sequence ID will never match the reference sequence ID, weather or not they are your own labels or ASV labels. I'd not recommend manually changing the taxonomy assigned to your ASV, as this would not represent what the curated GTDB classifier represents. If you do change the taxonomy, then you need to make sure that is explicitly stated in any manuscript.

It appears that you are leaning towards NCBI BLAST taxonomy assignments. If that is the case, then why not use the BLAST for classification instead of GTDB or SILVA? Remember when you are using a particular reference database, that database will contain different references and taxonomy, due to variation in curation. So you should view the output as "GTDB" says X, "SILVA" says "Y", and NCBI says "Z", etc...

I am not sure how your classifier does not contain any exact matches to your ASV? When I make the GTDB classifier using 'All', I observe them. Unless you are referring to a different ASV this time? If you are continuing to cluster the reference sequences to 0.99, that could be why those references do not appear in your reference set. Note, we do not recomend clustering reference sequences. The more you cluster your reference sequences, the less refined your taxonomic assignment. We discus this in our RESCRIPt paper. The historical reason for clustering reference sequences, was to save memory and hard drive space.

If the reference database was curated like we discussed earlier there should be many exact matches in the GTDB reference, at least for the ASV you provided previously. Which again... does not appear to me that you could get the classification you want using GTDB references, due to the myriad exact matches, for the reasons I outlined above about reference databases and the reliability of 16S rRNA gene sequences to provide genus/species level assignments.

If too many reference sequences and taxonomy are removed due to general filtering, then you could add them back and retrain the classifier.

I hope this helps.

5 Likes

Thank you so much for all the help Mike. I will work on this and keep all your suggestions in consideration.

Best,

Ankur

1 Like