Feature-classifier with Blast reference dataset

Benedict · July 5, 2018, 2:18am

Dear Qiime2 users,

May I know if there's any method to do the taxonomy assignment of query sequences against NCBI database? I have executed taxonomy assignment against Greengenes classifier as exactly way as the tutorial, yet I would like to compare the result with other accountable databases for a more reliable data.

So far as I know, there's a link being annotated to each unique features for direct blast to NCBI website, but it's super time-consuming. Would like to look for a more prompt way to blast them together? Take for an instance, download the 16S database from NCBI web and trim to V3-V4 region of ref seqs and train it to a usable classifier? If so, any database in NCBI will you recommend? If training into classifier, a robust taxonomy reference will be needed as well, where can I obtain it?

I tried to train my SILVA classifier targeting V3-V4 region of 16S rRNA, yet it was failed due to memory error. I found that there's a pre-trained full length SILVA classifier available in Qiime resource page, may I know if this classifier can be applied to assign the query sequences for V3-V4 region?

Mehrbod_Estaki · July 5, 2018, 3:13am

Hi @Benedict,

Have you had a chance to look through this new awesome tutorial that discusses the various taxonomic classifications available in qiime2 including classifying against Blast using the classify-consensus-blast plugin? This should cover a lot of the questions you have.

This may certainly be used but it may just take a bit longer since it's going through the full region. There may be some small improvements if you were to use a classifier trained to that specific region but I think those differences will be minor, especially with a larger region like V3-V4. The tutorial linked above does discuss this a bit more. Have a go through those and let us know if anything doesn't make sense.

Benedict · July 10, 2018, 9:10am

Hi,
Thank you for your response. I just go through the new tutorial as mentioned above. However, I still have some perplexity as listed beneath:-

What's means by MAXIMUM NUMBER OF HITS to keep for each query? Should I follow the default setting as 10?
What does it mean by Minimum fraction of assignments must match top hit to be accepted as consensus assignment? The default setting is 0.51? What's this figure implies for?
May I know any website to download the reference seq and reference taxonomy? I'm doing bacterial community associated with plant, targeting V3-V4 region of 16S rRNA. I tried out NB classifier with Greengenes database, yet, there're too many assignment without specific name.

Mehrbod_Estaki · July 11, 2018, 3:46am

Hi @Benedict,
You can read about the default parameters and their description and implications in this paper.
In this method you're basically comparing your reference databases taxonomy assignment of your query to the top hits of blast and coming to a 'consensus' level for which you will assign taxonomy to that feature. Starting at the Kingdom level and descending until there is no longer a consensus.

This is the number of the top hits from Blast that you are going to look at when comparing to your reference database.

This is the minimum fraction of those hits that must match to your reference assignment before you reach 'consensus'.

Have a look through this section for some popular reference databases.

Can you clarify what you mean by without specific name? Do you mean that there is no assignment at all (Unassigned) or perhaps only at Kingdom level? Or you're simply not getting species level? If the former, then there might be some more concerning problems with your feature table, for example if your primers/barcodes were not removed from your reads prior to denoising. If the latter, then consider that not reaching species level resolution is very common for 16S short-target amplicon sequencing as the read length simply may not be long enough to differentiate taxa at the species level.

Benedict · July 11, 2018, 4:04am

Hi,

That's mean the greengenes reference seq can be used in classify-consensus-blast? I thought the dataset can be obtained from ncbi>download>FTP only. Sorry go off on a tangent, do you have any suggestion to download the v3-v4 region of 16S rRNA from NCBI website? Any keyword suggested? (I'm too new in Bioinfo world...)
And regarding to the question pertaining to unspecific name of some assignment, it was the latter explanation - not getting down to species lvl. Will take your justification into my deliberation, thanks!

Nicholas_Bokulich · July 11, 2018, 6:19pm

Yes. Any fasta sequence data, in fact, can be used (with accompanying taxonomy in the same format used by greengenes). Once it is imported as a QIIME2 FeatureData[Sequence] artifact.

The reason: classify-consensus-blast is NOT the same as NCBI BLAST. Read the paper that @Mehrbod_Estaki linked to — the classify-consensus-* methods wrap an alignment algorithm (blast+ or vsearch) for database searching, but then use a LCA method to find consensus taxonomy. So the same underlying algorithm for database searching, but with some code to determine taxonomic consensus among hits.

The database is searched for matches to a query sequence. The top maxaccepts hits in the database are retained that have ≥ perc-identity to the query; consensus taxonomy is assigned by finding the deepest taxonomic rank where min-consensus of the hits share the same assignment. So the default parameters find the 10 best hits in the database with ≥ 80% identity to the query, and taxonomy is assigned to the rank where more than half of these hits share the same lineage.

Those parameter names (obviously excluding those used for LCA consensus assignment) are the same used by blastn — so you can check out the blastn documentation for more details on the underlying algorithm.

The default parameters are based on the results of that paper that @Mehrbod_Estaki linked to. You can alter these to exclude the LCA consensus assignment by setting maxaccepts to 1.

So that is the intended behavior of this method, to prevent "overclassification". Short DNA segments (e.g., V3-V4) can only contain so much information, and it is very difficult to reliably classify to species level (read that paper @Mehrbod_Estaki linked to). Doing something like top BLAST hit is a bad idea, because that is just the closest match, not the right match (or even necessarily better than other hits that are equally close) — it will give you a species name even though that is probably not correct, and other similar species may be equally close to the query. That is what we call "overclassification". To prevent that, we use methods that incorporate prediction confidence (e.g., classify-sklearn) or LCA consensus assignment to figure out the most specific lineage that a sequence may reliably belong to. For 16S rRNA gene amplicons, this is most frequently genus level (when classified correctly!) which is at times unsatisfying but technically correct.

If you are not worried about overclassifying, check out that paper to see parameter settings that will allow you to overclassify with some degree of safety.

For 16S I would just recommend using the Greengenes or SILVA databases, because otherwise you will need to format your own taxonomy strings from NCBI sequences.

I would also recommend just downloading the full 16S sequences and then trimming to V3-V4 (e.g., with qiime feature-classifier extract-reads)

Good luck!

Benedict · July 12, 2018, 1:53am

Thank you @Nicholas_Bokulich. It's really informative explanation which make my mind crystal clear on the operation of classify-consensus!

system · August 12, 2018, 7:53am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.