Combining taxonomy from different classifiers to eliminate unidentified sequences? [newbie question]

sarsaparella · April 23, 2024, 12:23pm

Hello!

I'm trying to classify my fungal ITS amplicon data

I classified dada2 representative sequences with qiime feature-classifier classify-sklearn using colinbrislawn's classifier (unite_ver10_dynamic_04.04.2024-Q2-2024.2.qza) after which I had a lot of unidentified sequences (5785 sequences out of 7635 were unidentified):

I manually extracted these sequences (k__Fungi;;;;;;;__), and classified them using qiime feature-classifier classify-consensus-blast against the same UNITE database:

qiime feature-classifier classify-consensus-blast \
--i-query unknown_sequences.qza \
--i-reference-reads unite_24/sh_refs_qiime_ver10_dynamic_04.04.2024.qza \
--i-reference-taxonomy unite_24/sh_taxonomy_qiime_ver10_dynamic_04.04.2024.qza \
--p-maxaccepts 10 \
--p-evalue 0.0001 \
--p-perc-identity 0.8 \
--p-query-cov 0.8 \
--o-classification blast_taxonomy.qza \
--o-search-results blast_search_results.qza \
--verbose >blast_taxonomy_paired.log

Then I modified (in R) the taxonomy table that was created by classify-sklearn and filled taxonomies of sequences that were previously unidentified with taxonomies from classify-consensus-blast (412 sequences out of 5785).
My results are better now, but I'm thinking is it really okay to modify taxonomy files like this? Maybe I did everything wrong and there is a right way to combine taxonomies?

Nicholas_Bokulich · April 23, 2024, 12:56pm

Hi @sarsaparella ,

Welcome to the forum!

Great question — we developed a method in the RESCRIPt plugin precisely for this type of use case, where you might want to merge multiple taxonomies to find the "better" classification. See the description of the merge-taxa action here (and the rest of that tutorial might be of interest as well for other actions in RESCRIPt that you can use to modify taxonomies):

You will not want to merge using the lca mode, but maybe with the length mode, so that you take the longest taxonomy classification (i.e., with the most ranks) as the best result.

But you shuold definitely approach this with caution, as longest does not necessarily mean best. I recommend inspecting the BLAST results carefully to make sure that the hits look reasonable and that you are not seeing any unexpected clades.

Usually unclassified ITS sequences are non-target hits, e.g., to host plant/animal or other non-fungal eukaryotes. So you might want to use NCBI blast to spot-check a few. If this is the case, I suggest filtering these out and moving on. But sometimes sequences fail to classify for other reasons, e.g., because these are in mixed orientations (the classify-sklearn method assames that all sequences are in the same orientation, whereas the classify-consensus-* actions can classify in both orientations by default). This does not appear to be the issue here, as blast is still leaving many sequences unclassified, so I suspect non-target DNA.

Good luck!

sarsaparella · May 2, 2024, 2:00pm

Thank you for answering my question and for confirming my approach!

I checked unclassified sequences and there were indeed a lot of non-targets BUT!! You were so right about suspecting mixed orientation sequences!

I did qiime rescript orient-seqs on my representative sequences and got these results:

Forward oriented sequences: 413 (30.43%)
Reverse oriented sequences: 437 (32.20%)
All oriented sequences:     850 (62.64%)
Not oriented sequences:     507 (37.36%)
Total number of sequences:  1357

Then I classified sequences that were oriented with classify-sklearn again and merged resulting taxonomy with taxonomy from previous run and this is what I got:

A lot less unassigned sequences! I am very happy!! Will try to classify those that were unassigned + unclassified fungi and ascomycota with blast to squeeze out everything from my data and them will delete non-targets
Thank you so much for your advice, expertise and for creating such an amazing tool with community around it!