Classifier tool discard seqences that aren't found in reference database

przemekiljan · March 9, 2021, 1:53pm

Hello fellow qiimers! That's my first post here so I hope that it'll follow all community guidelines

I'm using :qiime2: to perform 16S metagenomic analysis. Recently I wanted to review all possible options for classifying tools (been using vsearch classifier previously) and when I've tried to incorporate other commands from feature-classifier into my workflow I've received following error message from taxa barplot tool:

Feature IDs found in the table are missing from the taxonomy: {'b7d56027f65de48259ecde66b95a2247', '452c2a0920b0890c86a02f3079c2f756'}

Upon further inspection I've found out that number of features doesn't match for input (1.8 MB) and output (754.2 KB) for classifier method. Indeed both of two missing ASVs belong to species that couldn't be found in my curated silva database, since they belong to two separate fungal organisms, and my database is targeted at V3-V4 regions of bacterial 16S gene.

Previously I've found vsearch classifier to treat those sequences as unassigned and leave them be (example (646.3 KB) containing aforementioned IDs as "Unassigned", which is an output of classify-consensus-blast method, with exception of using unmodified silva database), but classify-hybrid-vsearch-sklearn classifier, as well as classify-consensus-blast method discard them, which may later lead to consistency issues, as presented above - on example of taxa barplot command.

I believe that there is some way around it, using various filtering methods, but I wasn't able to come up with anything elegant yet.

What I find most interesting in this situation is that using those two commands that make up hybrid classification method separately doesn't create such problems, but I suppose that those were heavily modified in order for them to work combined.

Nicholas_Bokulich · March 9, 2021, 3:38pm

Welcome to the community, @przemekiljan !

This is actually expected behavior, but that might not be very apparent. An initial pre-filter step is performed to remove query sequences that very poorly match the reference sequences. To quote the help documentation:

First performs rough positive filter to remove artifact and
low-coverage sequences (use "prefilter" parameter to toggle this step on
or off).

So you can use the prefilter option to prevent those two sequences from being filtered.

To remove missing features from your table, do something along these lines:

qiime feature-table filter-table \
    --i-table your-table-that-has-all-features.qza \
    --m-metadata-file your-taxonomy-output-from-classify-hybrid-that-is-missing-features.qza \
    --o-filtered-table filtered-table.qza

This hybrid pipeline is actually quite simple — it just has the addition of that initial pre-filter, which is why it filters out those sequeces intentionally, and the individual classification methods do not.

I hope that helps!

system · April 9, 2021, 9:39pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.