Combining taxonomy from different classifiers to eliminate unidentified sequences? [newbie question]

Nicholas_Bokulich · April 23, 2024, 12:56pm

Welcome to the forum!

Great question — we developed a method in the RESCRIPt plugin precisely for this type of use case, where you might want to merge multiple taxonomies to find the "better" classification. See the description of the merge-taxa action here (and the rest of that tutorial might be of interest as well for other actions in RESCRIPt that you can use to modify taxonomies):

You will not want to merge using the lca mode, but maybe with the length mode, so that you take the longest taxonomy classification (i.e., with the most ranks) as the best result.

But you shuold definitely approach this with caution, as longest does not necessarily mean best. I recommend inspecting the BLAST results carefully to make sure that the hits look reasonable and that you are not seeing any unexpected clades.

Usually unclassified ITS sequences are non-target hits, e.g., to host plant/animal or other non-fungal eukaryotes. So you might want to use NCBI blast to spot-check a few. If this is the case, I suggest filtering these out and moving on. But sometimes sequences fail to classify for other reasons, e.g., because these are in mixed orientations (the classify-sklearn method assames that all sequences are in the same orientation, whereas the classify-consensus-* actions can classify in both orientations by default). This does not appear to be the issue here, as blast is still leaving many sequences unclassified, so I suspect non-target DNA.

Good luck!