I am experiencing that a number of results do not arrive to the level of species classification at least if studied by means of relative percentage.
I am using open-reference clustering and default database in qiime2 release.
In a recente excercise with public data I found that using NCBI BLAST on let’s say unclassified sequences I had the opportunity to identify something I was missing in my results.
This is really quite typical — short amplicon reads (e.g., of 16S) typically cannot be resolved to species level because they match more than one species.
NCBI BLAST results are misleading — of course you get species-level classifications, because you are performing local alignment against other (usually longer) sequences that typically have species annotations. Just because a short read aligns to a reference, even perfectly, does not mean that is a correct match.
It is important to assess (1) how good is that match, e.g., how much coverage and mismatches and (2) how many other taxa have equally or similarly good hits?
Assessing the quality of matches can be an arduous process, especially if you have 100s or 1000s of sequences — and this is why other methods are used by QIIME 2 (and similar platforms) for taxonomic classification: to automate the process of taxonomic classification.
All that said, seeing unclassified sequences classify to bacteria with NCBI BLAST can sometimes indicate an issue with the database/classifier you are using, or with the query sequences. I recommend checkout out the following troubleshooting steps just to make sure:
Different databases can often give different results, but not always better, and getting NCBI sequences in a QIIME 2-ready format can be a bit difficult (since NCBI does not release QIIME 2-formatted files). The link you provided is to qiime1 files, so is probably woefully out of data even if it is formatted correctly.
A few options:
See the link above.
try training a classifier to your specific amplicon region (see the tutorial at qiime2.org for details)
try a different taxonomic classification methods in q2-feature-classifier, like classify-consensus-vsearch
Thanks a lot for your answer and suggestions. I would like only to adda that the ampliacon used is pretty long (500nt) however I imagine that your comments about the possibility to reach the species level are still good, while for shorter amplicons also order is not reached.
Following your reasoning I just fix something:
regarding the database: in the link I proposed a QIIME1 formatted database from NCBI is proposed with sequences and taxonomic assignments. Now if we imagine to investigate how many different species are theoretically present in that or in the greengenes and maybe we would find that one is somehow enriched in assignments to species for example different Lactobacillus species, and the other one not, may we conclude that one is more appropriate to “map” those Lactobacillus species?
If this is correct I would go for maybe fixing the region with your suggested step, but switch to closed reference clustering because if I got it correctly it would not be good for example derive with a BLAST on NCBI assign taxonomy to an OTU not assigned, is this reasoning correct?
Length is not everything — different regions will have more/less resolving power for different taxonomic groups, so 500nt amplicons still may have trouble getting species level classification, and not that much better than 300 nt amplicons (which can also generally get down to genus level, but it all depends on the region).
The issue with your proposed approach is that closed-ref OTU clustering can give misleading results… you are aligning against reference sequences and just picking the top hit, which for the purposes of taxonomy classification is the equivalent of BLASTing and taking the top hit, which is problematic as I described. Greengenes attempted to fix this issue by creating a consensus taxonomy among sequences in each cluster, which is why so many greengenes taxonomies are incomplete: the ambiguous levels indicate that a sequence in that cluster could belong to any number of unknowns. NCBI sequences would need to be processed accordingly or else you will risk getting a very high rate of false-positive classifications.