Using merge-taxa before building classifier

We have downloaded marine invertebrate sequence and taxonomy data from NCBI (loosely following this Building a COI database from NCBI references) and then used feature-table merging (as per Merge separate sequence and taxonomy artifacts output from RESCRIPt) as we had to download the data in small chunks to avoid downloads failing.

The way we downloaded was to try downloading from a high taxonomic rank and if that was too big, we downloaded the separate taxa at the next rank down. This means that the downloads are at various taxonomic levels, depending on the size of the taxa.

We then constructed our classifier using feature-classifier fit-classifier-naive-bayes and used it with feature-classifier classify-sklearn with our sequence data.

This results in our ASVs being identified either to the sub-taxa where they belong or to the top rank included in the taxonomy (metazoa). This is not as such unexpected, but it would be nice to have unidentified things end up in the nearest higher level rank. For example, we found that mollusca was too big to download in one group, so it was split at the next taxonomic rank. Within that, gastropoda was also to large, and was split up into its lower groups. Now when we have an unidentified gastropod, this does not become assigned to gastropoda or even mollusca, but is assigned to metazoa (we think). This as far as I can tell, is by necessity, as the information needed for it to classified as unidentified gastropoda or mollusca is not included in our classifier (as it only includes lower taxonomic ranks).

Is there a way to construct a classifier from piecemeal NCBI downloads that would "know" which higher taxon to assign an unidentified ASV to? Should we download unclassified data for the taxonomic ranks that are too large in addition to the identified lower ranks?

To explain why we want all of these unclassified; the vast majority our reads are currently being classified as "metzoa", so we believe it would at least help us determine approximately what we have in our data.

Hi @asajoh,

It is difficult to determine what the issue is based on your description.

Downloading in separate batches should not matter as long as you are pulling the same taxonomic ranks with the --p-ranks option of rescript get-ncbi-data.

"metazoa" is not a true taxonomic rank and is not pulled using rescript get-ncbi-data. Thus, I am not sure how you are obtaining "unclassified metazoa" as a result. Without looking at your files, there is a chance you are mixing different formats of taxonomic strings from different sources. That is, the taxonomic strings are not consistent. You can look into rescript edit-taxonomy to try and make the taxonomy strings consistent.

Can you share your commands, and/or your taxonomy file(s)? You can share these with me via private message.

Well, that is certainly interesting. In the middle of something else right now, but we'll take a closer look and get back to you. As always, thanks for the help and useful feedback :slight_smile:

1 Like

Hi @asajoh,

It is difficult to determine what the issue is based on your description.

Downloading in separate batches should not matter as long as you are pulling the same taxonomic ranks with the --p-ranks option of rescript get-ncbi-data.

We did not specify ranks in the download, so our code was something like this (depending on the taxa we were downloading):

!qiime rescript get-ncbi-data \
--p-query 'txid31265[ORGN] OR txid6340[ORGN] OR txid7563[ORGN] OR txid10205[ORGN] OR txid10229[ORGN] AND (cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title])' \
--p-n-jobs 16 \
--o-sequences ncbi-group1-refseqs.qza \
--o-taxonomy ncbi-group1-taxonomy.qza

"metazoa" is not a true taxonomic rank and is not pulled using rescript get-ncbi-data. Thus, I am not sure how you are obtaining "unclassified metazoa" as a result. Without looking at your files, there is a chance you are mixing different formats of taxonomic strings from different sources. That is, the taxonomic strings are not consistent. You can look into rescript edit-taxonomy to try and make the taxonomy strings consistent.

We are unsure how this has happened, but the top level in the taxonomies is definitely "Metazoa" (though I did not previously capitalise it). The taxonomies also seem to look the same in the different subgroups. I have attached the taxonomy from a couple of the subgroups that we downloaded as an example.

At the moment, we are trying a couple of things as part of troubleshooting. We had a look at the merged taxonomy, and it has fewer lines than the total of the groups. Perhaps this is due to some overlap at higher levels? There are not a lot of lines missing, perhaps 1% of the total expected number of lines given that the individual taxonomies were simply stitched together. For clarity, this is the code we used to merge:

!qiime feature-table merge-taxa \
--i-data \ncbi-group{1..44}-derep-taxonomy.qza \
--o-merged-data ncbi-marine-inv-taxonomy.qza

We've tried it with 2 and 3 files as well as naming two files individually instead of using curly brackets to denote the groups that we want. The results are the same for both methods and even when merging fewer groups, we are losing a few lines.

Another thing is that we ran

!qiime feature-classifier extract-reads \
--i-sequences ncbi-group1-filtered-refseqs.qza \
--p-f-primer CGWACWGGWTGAACWGTWTAYCCYCC \
--p-r-primer TANACYTCNGGRTGNCCRAARAAYCA \
--p-n-jobs 24 \
--o-reads ncbi-group1-primer-refseqs.qza

on the sequences before training the classifier, so we are wondering about a couple of things:

  • Should we have run "dereplicate" again AFTER extracting reads? Are we causing a mismatch between the taxonomy and the sequences?
  • We wanted to run rescript evaluate-fit-classifier, but we ran out of disk space, so we've run feature-classifier fit-classifier-naive-bayes. Is there a "correct" way of doing this and how can we tell if the classifier is ok? I'm googling at the moment, but I'm having trouble finding a good way to evaluate the resulting classifier. However, that's next on the agenda.

Running qiime tools validate ncbi-marine-inv-classifier.qza returns "appears to be valid at level=max".
ncbi-group1-taxonomy.qza (408.7 KB)
ncbi-group39-taxonomy.qza (7.1 KB)

Hi @asajoh,

When it comes to certain marker genes, like COI, particularly for eukaryotes, I like to include more intermediate taxonomic ranks. This often has the advantage of being better able to discriminate among groups, but also has the disadvantage of using up more resources to create and use a classifier, if you go that route.

For example, if you look at the taxonomic information for Hypselodoris zephyra, you'll see the full taxonomy. If you mouse-over each taxonomy, a tool-tip will appear displaying the rank. You should be able to select most of these ranks via the plugin (see the help text to determine which ranks are allowed; for example anything simply labeled as "clade" is currently not allowed). But I'd suspect this would mitigate some classification issues.

But off-hand I see nothing wrong with your command. :+1:

This is my mistake. Some databases use different nomenclatural rules / annotations, etc... I think I confused myself in my initial response about "Metazoa" as a taxonomic annotation. It appears that NCBI Taxonomy does slot "Metazoa" within the rank of "Kingdom". Though we can debate on weather or not this is a "real" taxonomic designation. :thinking:

Do you have a specific example? I quickly made the QZVs and scrolled through these files, and I did not notice anything out of the ordinary. Just an FYI, --p-rank-propagation is enabled by default, and is generally explained within our SILVA tutorial under the Getting SILVA data the easy way section. Just click on the expandable menu "Rank Propagation".

There is no ned to do this, but I prefer to do this in order to help keep my resource use low.

Nope. I'd simply just run feature-classifier fit-classifier-naive-bayes and then rescript evaluate-classifications.

This should not happen. Do you have QZVs of the barplots you can share? It is unclear whether this is a "problem" with the classifier or just poor taxonomic resolution of COI, either due to poor annotation within the reference database or, simply that the amplicons themselves are not good at resolving certain clades. Both are common issues. :weary:

Sadly, in my own experience with metabardoing, including COI and other genes. I often find that there are quite a few issues with incorrectly annotated sequence data. Especially with moluscs and arthrpods. That is, I've come across many instances in which a submitted sequence was not what the submitters thought it was... that is contamination, or poor isolation of the organism that was intended to be sequenced. For example many arthropod sequences are in fact molluscs, and vice versa. AFAIK, these are still only updated / corrected by the initial submitters of the data, often not post-hoc curated by NCBI.

If you run rescript dereplicate with the --p-mode lca option (either on the full sequence, or the extracted amplicon sequence), and then visualize the taxonomy like so:

qiime metadata tabulate \
    --m-input-file ncbi-derep-taxonomy.qza \
    --o-visualization ncbi-derep-taxonomy.qzv

and compare to the rescript dereplicate --p-mode uniq you should be able to tease apart which sets of sequences are being collapsed. Particularly those sequences that have been completely mis-annotated. This is indeed the most frustrating part of curating a reference database. :tired_face:

Keep us posted! I am quite interested in what you find! :postbox:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.