Database for OTU clustering NCBI

Nicholas_Bokulich · February 3, 2020, 10:03pm

This is really quite typical — short amplicon reads (e.g., of 16S) typically cannot be resolved to species level because they match more than one species.

NCBI BLAST results are misleading — of course you get species-level classifications, because you are performing local alignment against other (usually longer) sequences that typically have species annotations. Just because a short read aligns to a reference, even perfectly, does not mean that is a correct match.

It is important to assess (1) how good is that match, e.g., how much coverage and mismatches and (2) how many other taxa have equally or similarly good hits?

Assessing the quality of matches can be an arduous process, especially if you have 100s or 1000s of sequences — and this is why other methods are used by QIIME 2 (and similar platforms) for taxonomic classification: to automate the process of taxonomic classification.

All that said, seeing unclassified sequences classify to bacteria with NCBI BLAST can sometimes indicate an issue with the database/classifier you are using, or with the query sequences. I recommend checkout out the following troubleshooting steps just to make sure:

Different databases can often give different results, but not always better, and getting NCBI sequences in a QIIME 2-ready format can be a bit difficult (since NCBI does not release QIIME 2-formatted files). The link you provided is to qiime1 files, so is probably woefully out of data even if it is formatted correctly.

A few options:

See the link above.
try training a classifier to your specific amplicon region (see the tutorial at qiime2.org for details)
try a different taxonomic classification methods in q2-feature-classifier, like classify-consensus-vsearch
check out q2-clawback: Using q2-clawback to assemble taxonomic weights

Good luck!