Yes. Any fasta sequence data, in fact, can be used (with accompanying taxonomy in the same format used by greengenes). Once it is imported as a QIIME2 FeatureData[Sequence]
artifact.
The reason: classify-consensus-blast is NOT the same as NCBI BLAST. Read the paper that @Mehrbod_Estaki linked to — the classify-consensus-*
methods wrap an alignment algorithm (blast+ or vsearch) for database searching, but then use a LCA method to find consensus taxonomy. So the same underlying algorithm for database searching, but with some code to determine taxonomic consensus among hits.
The database is searched for matches to a query sequence. The top maxaccepts
hits in the database are retained that have ≥ perc-identity
to the query; consensus taxonomy is assigned by finding the deepest taxonomic rank where min-consensus
of the hits share the same assignment. So the default parameters find the 10 best hits in the database with ≥ 80% identity to the query, and taxonomy is assigned to the rank where more than half of these hits share the same lineage.
Those parameter names (obviously excluding those used for LCA consensus assignment) are the same used by blastn
— so you can check out the blastn documentation for more details on the underlying algorithm.
The default parameters are based on the results of that paper that @Mehrbod_Estaki linked to. You can alter these to exclude the LCA consensus assignment by setting maxaccepts
to 1.
So that is the intended behavior of this method, to prevent "overclassification". Short DNA segments (e.g., V3-V4) can only contain so much information, and it is very difficult to reliably classify to species level (read that paper @Mehrbod_Estaki linked to). Doing something like top BLAST hit is a bad idea, because that is just the closest match, not the right match (or even necessarily better than other hits that are equally close) — it will give you a species name even though that is probably not correct, and other similar species may be equally close to the query. That is what we call "overclassification". To prevent that, we use methods that incorporate prediction confidence (e.g., classify-sklearn
) or LCA consensus assignment to figure out the most specific lineage that a sequence may reliably belong to. For 16S rRNA gene amplicons, this is most frequently genus level (when classified correctly!) which is at times unsatisfying but technically correct.
If you are not worried about overclassifying, check out that paper to see parameter settings that will allow you to overclassify with some degree of safety.
For 16S I would just recommend using the Greengenes or SILVA databases, because otherwise you will need to format your own taxonomy strings from NCBI sequences.
I would also recommend just downloading the full 16S sequences and then trimming to V3-V4 (e.g., with qiime feature-classifier extract-reads
)
Good luck!