How to assign all non-reference sequence hits a value? (Vsearch)

Micro_Biologist · January 16, 2018, 9:20am

Hello,

I am using Vsearch to (currently closed reference) assign taxonomy to my sequences using:

qiime vsearch cluster-features-closed-reference \
--i-table Table97.qza \
--i-sequences RepSeqs97.qza \
--i-reference-sequences DiatomRefSeqs.qza \
--p-perc-identity 0.95 \
--p-threads 0 \
--o-clustered-table ClosedTable95.qza \
--o-clustered-sequences ClosedRepSeqs95.qza \
--o-unmatched-sequences UnmatchedRepSeqs95.qza \
--verbose

And I am currently taking all the abundance table ClosedTable95.qza into R for data manipulation.

I was wondering if there was a way of assigning all 'non-hits' a default name for further analysis, as my reference database is only a small proportion of the expected sequences (15%) and in order to be correctly formatted for 'downstream analysis' any non hits must be assigned something like "No_blast_hit", is this possible using open-reference sequencing?

EDIT: Adding an identifier tag within non-hit sequences would also be fine, as I could change them all in R later.

EDIT2: I have found a way within R (changing all feature IDs with over 25 characters to No_Hits) but a Qiime alternative if known would be good for methods not using R manipulation.

Nicholas_Bokulich · January 16, 2018, 5:09pm

Hi @Micro_Biologist,

Using closed-reference OTU picking to assign taxonomy is essentially just assigning the top hit taxonomy. I do not recommend that approach as discussed elsewhere, and would instead recommend assigning taxonomy subsequently using the vsearch classifier available in q2-feature-classifier. This classifier has a perc-identity parameter that you can adjust for the same purposes. The consensus taxonomy assignment provided by that classifier will also determine whether other near hits exist, in which case the top hit may not be a reliable classification. However, the consensus assignment can also be disabled by setting maxaccepts to 1, if you prefer to just grab the top hit.

Additionally, that classifier would report unclassified sequences as unclassified... hence no workarounds would be required.

Using the vsearch classifier (rather than OTU picker) would be the correct way to do this in QIIME2, and provides the necessary output.

You may also be interested in exclude-seqs (see also this tutorial). That would allow you to just remove all sequences that do not match a reference database above a specified percent identity. You can use this to split your feature table into two — e.g., to analyze these data in separate batches (e.g., classify diatoms with a diatom database and analyze non-diatoms separately downstream) — or just to discard other sequences (e.g., if you do not want diatom sequences, just get rid of them).

I hope that helps!

Micro_Biologist · January 17, 2018, 7:21am

Thought it was too easy! (Kidding )

Thanks for your help I'll try get that working!

EDIT: If I know there are no duplicate sequences and I do want the top hot is using vsearch in this way okay?

I only ask as using vsearch classifier means all my sequences get no hits whatsoever?

Nicholas_Bokulich · January 17, 2018, 1:47pm

It's not just duplicates — your sequences probably won't be a perfect match so there may be many hits that are near matches within a similar percent identity. But theoretically, yes, you could do things this way if you really just want the top hit. Even then, I'd still recommend using classify-consensus-vsearch as that assigns a proper taxonomy to all sequences without needing to do a workaround.

Really? If none of your sequences are being classified, we should examine the parameters you are using... it sounds like maybe the parameters need to be tuned for your specific data (these methods have been tuned for bacteria and fungi, not diatoms! which may need a bit of tweaking)

Micro_Biologist · January 17, 2018, 3:28pm

Unfortunately most of the problems seem to arise from this methods reliance on it being similar to a previous method (microscopy based).

I will get working on it when I get into work tomorrow! Hopefully with some parameter altering I can get it working, if not I'll come back and ask for more help, thank you very much!

system · February 17, 2018, 9:28pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.