BLAST consensus question

Hi there,

I used the Midori reference database which contains all metazoan species based on Genbank for my diet study. The database was trimmed with my forward and reverse primers. I used BLAST + for my classification with the following parameters:

qiime feature-classifier classify-consensus-blast
--i-query dada2-rep-seqs.qza
--i-reference-reads midori-ref-seqs.qza
--i-reference-taxonomy midori-ref-taxa.qza
--p-maxaccepts 1000
--p-perc-identity 0.97
--p-query-cov 0.89
--p-strand both
--o-classification EMR-diet-taxonomy-blast.qza
--verbose

After all of my filtering steps, I took some of the sequences identified to species and put them into NCBI blast. For example, this sequence:

TTTATCCAGAAACATTGCGCATGCTGGACCCTCTGTAGATCTAGCAATTTTCTCTCTTCATTTAGCTGGAGCATCATCAATTCTTGGTGCCATTAACTTTATTACAACAGTTATTAATATACGATGAAGGGGCCTACGTCTAGAACGTATTCCCTTATTTGTATGAGCAGTATTAATTACTGTAGTGTTACTTCTTCTCTCTTTACCAGTTCTTGCTGGTGCAATTACTATACTTCTTACAGACCGAAACCTAAACACCTCATTCTTTGATCCTGCAGGGGGCGGAGACCCAATTCTATATCAGCATTTATTC

was identified as Dendrodrilus rubidus in my output with a consensus of 1, but NCBI blast identified this sequence as Dendrodrilus sp., Lumbricidae sp., and Bimastos rubidus with 100% percent identity.

My question is, how does the BLAST+ classification method deal with classifying a taxa if the sequences are the same for multiple taxa? I'm a little concerned about using a species level classification if NCBI blast is identifying multiple species with 100% identity. I'm wondering if I should move classification down to a different level if the sequence belongs to multiple species.

Please let me know if anymore information is required. Thank you!

Hi @Newt,

I suspect, that part of the reason why classify-consensus-blast is ultimately choosing Dendrodrilus rubidus, has to do with the --p-min-consensus setting. If you read the help documentation by running:

qiime feature-classifier classify-consensus-blast --help

You'll see that the help text for --p-min-consensus says:

--p-min-consensus NUMBER Range(0.5, 1.0, inclusive_start=False,
    inclusive_end=True)   Minimum fraction of assignments must match top hit
    to be accepted as consensus assignment. [default: 0.51]

This setting, as well as what sequence data is ultimately included within your curated reference database, can alter the taxonomy consensus assignment.

This is quite common with short-read amplicon data. Which is why there are a variety of consensus approaches (vsearch & BLAST), and other taxonomy assignment tools (scikit-learn, RDP,...) out there.

Also, I recommend trying this new RESCRIPt approach for generating a reference database from GenBank data.

2 Likes

Thank you. So If I'm understanding this correctly, more than a 0.51 fraction of assignments belonged to Dendrodrilus rubidus, therefore this species was ultimately chosen to belong to that sequence? So even if the other species were included in my database, the consensus would still belong to Dendrodrilus rubidus?

That makes sense. I decided to use the BLAST consensus approach because the trained scikit-learn classifier resulted in many inaccurate assignments of species that did not occur in the geographic area and low confidence scores.

I chose the MIDORI database because the diet of this species is not well known, so we wanted to pick up on all potential prey items since the database contains all metazoans from GenBank. Is using the RESCRIPt approach more accurate for something general like this?

Thanks for the help.

1 Like

Correct. At least up to the best LCA it can resolve to.

Where they inaccurate or unresolved? Generally speaking, this often results from a reference database not being as expansive as it could be, i.e. not having enough target reference sequences in the database. Additionally, the marker gene region may not provide enough resolution for your target organisms of interest. Or a combination of these two issues or something else.

I would classify your sequences against the MIDORI database w/o trimming out the amplicon region. Sometimes, in silico extraction of amplicon sequences via PCR primer matches may fail, due to the reasons I've outlined in the RESCRIPt tutorial I linked earlier. Potentially, leading to issues in taxonomy assignment. If you are better able to identify taxa with the full-length MIDORI database, then that would suggests that the primer-based extraction did not work as well.

If you obtain similar results, then it could mean that the marker gene itself may not provide the needed resolution. I've had this problem with various metazoan eDNA studies myself. In which case the next step is to try another marker gene like 12S rRNA, or something else.

I honestly do not know. I only mentioned it for purposes of providing another option. At least try a step that does not depend too heavily upon PCR primer searches to extract the amplicon region. :man_shrugging:

In fact, you can take the amplicon region you've extracted from the MIDORI database, and use that as a reference to extract even more amplicon regions from MIDORI, at least those that might have been missed with the initial PCR primer search and extraction. That is, this will ensure that you've extracted as much of your amplicon region of interest from MIDORI. I guess I am saying try using both MIDORI and RESCRIPt. :handshake:

Besides, there are many approaches to constructing and curating a reference databases. The best thing to do is use the approach that best suites your needs.

Perhaps others on the forum have suggestions or approaches I'm not aware of?

-Cheers!
-Mike

2 Likes