VSearch classifier output: perc-identity information

Sanni_Hintikka · September 2, 2020, 9:28am

Hi!

I am running Qiime2-2019.10 in a linux environment on a HPC and analysing some COI metabarcoding data for eukaryotes.
I have a question on the outputs from the vsearch classifier..

From blasting a few of my rep seqs I have from demultiplexing (dada2 plugin), I find interesting "hits" that could potentially point towards real signals due to their high frequency in the samples, but their pid is <95%. Most of these features also have more than 10 of the same "hit" when doing the individual blast. I believe these could potentially be hits on a genus or family level for some organisms.
I run qiime feature-classifier classify-consensus-vsearch with changes only in the following parameters
--p-perc-identity 0.97
--p-min-consensus 0.75
and this obviously just labels "Unassigned" to the features with hits under <97%.
I would like to lower the perc-identity so that I can see those lower pid hits as well, but I am unsure if the perc-identity that the hits were found with is recorded in the vsearch outputs anywhere?
So in other words, if I do a classification with e.g. perc-identity 0.94, can I somehow filter or extract the resulting taxonomic assignments based on the perc-identity they were found with?

Many thanks in advance!

Nicholas_Bokulich · September 2, 2020, 4:58pm

Welcome to the forum @Sanni_Hintikka!

Seems likely

VSEARCH records this information during alignment, but then q2-feature-classifier (in the process of finding the consensus taxonomy) chucks out that information once it is done with it.

Short answer: no, not directly. You could classify at multiple thresholds. E.g.,

classify at 97%
filter out unclassified seqs
re-classify unclassified seqs at 94%
etc

What is your ultimate goal, though? I may be able to think up a more creative solution.

E.g., if you are concerned that the default settings are leading to misclassification (due to the large number of sub-optimal hits), check out the --p-top-hits-only option for that classifier. You can reduce the % id, increase the max-accepts and use top hit only... that would gather more hits for the consensus assignment, but only if they are top hits.

Sanni_Hintikka · September 3, 2020, 10:03am

Thanks!

That's what I was afraid of..

And this is what I was afraid you'd suggest Luckily I have access to a HPC so can do other things while these run..

Now, about the "ultimate goal", in a nutshell. The basic idea behind the data is the comparison of species occurrence between two different habitats (namely, mangroves and reefs). But, some of the sampling has been done in lesser studied areas, where some of the even more commonly found species may not yet be sequenced. If I can say with some level of confidence that features x and z come from e.g. a grunt family (entirely hypothetical scenario), but have so much difference between the features that they likely originated from two different species, I could still get an idea of the overall diversity in the samples, with some level of taxonomic insight. And be able to say if they occur in both or just one of the habitats.

I will keep this in mind. I am aware that some of the species found in my study areas only have a handful of entries in databases like MIDORI. What you're suggesting here sounds like a good idea, if I understand you correctly; if i set max-accepts at e.g. 20, and choose top-hits-only, it would pick up to 20 hits if they all have the same score (within the perc-id i set), but would still give me a result if there are only, say, 5 top hits, within the set perc-id?

Thanks!

Nicholas_Bokulich · September 3, 2020, 10:13am

Thanks for the context. I have a few ideas.

Focus on ASVs/OTUs to find features that differentiate habitats. You could use a differential abundance method or q2-sample-classifier to find the differentiating features, and then focus on those (e.g., use NCBI BLAST to manually evaluate the results, or re-classify following the exhaustive procedure I outlined, this time on a much more restricted number of features so it would be quick)
You could either use VSEARCH directly to get those outputs, or modify the q2-feature-classifier source code to save the vsearch outputs to a local directory before throwing out the alignment results. This would of course require some basic familiarity with python so is not ideal.

exactly! This allows you to use a more permissive max-accepts setting without worrying that you would add low-quality hits to the consensus classification... so you could use a higher setting like 20 or 30 and q2-feature-classifier will only look for taxonomic consensus among the top hits (any that tie for top place).

I hope that helps!

Sanni_Hintikka · September 3, 2020, 11:27am

Never even thought of finding the "driving" features before assigning taxonomy to them I will definitely try this!
And you're correct that modifying the source code is a little out of my reach
I will see what I get from following 1. and using the top-hits-only with higher max-accepts .

Thank you for your help!

system · October 4, 2020, 5:27pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.