Using merge-taxa before building classifier

SoilRotifer · May 8, 2023, 3:36pm

When it comes to certain marker genes, like COI, particularly for eukaryotes, I like to include more intermediate taxonomic ranks. This often has the advantage of being better able to discriminate among groups, but also has the disadvantage of using up more resources to create and use a classifier, if you go that route.

For example, if you look at the taxonomic information for Hypselodoris zephyra, you'll see the full taxonomy. If you mouse-over each taxonomy, a tool-tip will appear displaying the rank. You should be able to select most of these ranks via the plugin (see the help text to determine which ranks are allowed; for example anything simply labeled as "clade" is currently not allowed). But I'd suspect this would mitigate some classification issues.

But off-hand I see nothing wrong with your command.

This is my mistake. Some databases use different nomenclatural rules / annotations, etc... I think I confused myself in my initial response about "Metazoa" as a taxonomic annotation. It appears that NCBI Taxonomy does slot "Metazoa" within the rank of "Kingdom". Though we can debate on weather or not this is a "real" taxonomic designation.

Do you have a specific example? I quickly made the QZVs and scrolled through these files, and I did not notice anything out of the ordinary. Just an FYI, --p-rank-propagation is enabled by default, and is generally explained within our SILVA tutorial under the Getting SILVA data the easy way section. Just click on the expandable menu "Rank Propagation".

There is no ned to do this, but I prefer to do this in order to help keep my resource use low.

Nope. I'd simply just run feature-classifier fit-classifier-naive-bayes and then rescript evaluate-classifications.

This should not happen. Do you have QZVs of the barplots you can share? It is unclear whether this is a "problem" with the classifier or just poor taxonomic resolution of COI, either due to poor annotation within the reference database or, simply that the amplicons themselves are not good at resolving certain clades. Both are common issues.

Sadly, in my own experience with metabardoing, including COI and other genes. I often find that there are quite a few issues with incorrectly annotated sequence data. Especially with moluscs and arthrpods. That is, I've come across many instances in which a submitted sequence was not what the submitters thought it was... that is contamination, or poor isolation of the organism that was intended to be sequenced. For example many arthropod sequences are in fact molluscs, and vice versa. AFAIK, these are still only updated / corrected by the initial submitters of the data, often not post-hoc curated by NCBI.

If you run rescript dereplicate with the --p-mode lca option (either on the full sequence, or the extracted amplicon sequence), and then visualize the taxonomy like so:

qiime metadata tabulate \
    --m-input-file ncbi-derep-taxonomy.qza \
    --o-visualization ncbi-derep-taxonomy.qzv

and compare to the rescript dereplicate --p-mode uniq you should be able to tease apart which sets of sequences are being collapsed. Particularly those sequences that have been completely mis-annotated. This is indeed the most frustrating part of curating a reference database.

Keep us posted! I am quite interested in what you find!