Delays in taxonomic identification using Vsearch

Francesco_Martino · February 13, 2023, 10:23am

Hello everyone,

I would like to thank everyone in this forum for their hard work. I am relatively new to this, and I am currently facing a problem with my analysis.

I am trying to analyze my HiSeq 150x2 pair-end sequences using qiime2, and I have run into an issue. I initially tried to use the sklearn classifier, however, the results were not accurate enough due to limitations in my reference database (I am trying to identify some invertebrates). The species identified were similar to the ones I know should be present in the results, but they belong to places far away from the analysis site. So, I switched to using vsearch to get more reliable results, but I've been waiting for the results for about 6 days now and I think it's taking a bit too long...

Is there something I'm doing wrong in my pipeline? The command I am using is:

qiime feature-classifier classify-consensus-vsearch
--i-query representative-seqs.qza
--i-reference-reads ncbi-16s-derep.qza
--i-reference-taxonomy ncbi-16s-taxa-derep.qza
--p-perc-identity 0.97
--p-min-consensus 0.51
--p-maxaccepts 5
--p-threads 20
--o-classification taxonomy-vsearch.qza

I have tried to find a solution by reading all the previous posts in the forum, but unfortunately, I have not been able to reduce the waiting time.

Thank you for your help and understanding.

crusher083 · February 13, 2023, 12:29pm

Hello,

to see if your job runs smoothly add --verbose flag, it will give you a progress log. Unfortunately, with that info on hand I cannot estimate the runtime of vsearch, so it will be hard to say if 6 days is abnormal for this exact case.

Cheers,
V

Francesco_Martino · February 13, 2023, 1:53pm

Hi Valentyn,

thank you so much for the response!

Right now, after 6 and a half days, the command has finished, and the result seems much better! Obviously, since the number of sequences obtained from an Illumina HiSeq is very high, vsearch took a long time to perform the alignments.

Thank you so much for the advice about '--verbose' that I will definitely apply in the future, and I apologize for my useless question,

Cheers,
FM

crusher083 · February 13, 2023, 1:55pm

No worries, there are no useless questions!

Nicholas_Bokulich · February 13, 2023, 2:46pm

Hi @Francesco_Martino ,

Just to add, 6 days might be "normal" if you have a very large reference database and/or very large number of query sequences.

I am not sure that VSEARCH will be more accurate in this case. and VSEARCH can be quite slow, as it performs global alignment.

You might want to increase the confidence threshold in this case. It will likely lead to underclassification, e.g., to genus level instead of species, but it sounds like that may be better in your case than a species identification that is a misclassification. You could also try using the q2-clawback plugin to upweight species that you expect in your environment prior to classification — but this would be quite challenging to accomplish if you do not have pre-existing observation data for weighting species abundances.

If I were you I would:

check the job resources (CPUs/RAM) to make sure that it is actually still running
if it is, just keep waiting and accept that you have a larger-than-normal job

Francesco_Martino · February 13, 2023, 3:45pm

Hi @Nicholas_Bokulich!

Thanks for helping clear things up.

I might not have explained myself well when I talked about using vsearch being more reliable. My goal was to raise the confidence threshold so we could avoid strange identifications due to the lack of species sequences in my study area (the Venice Lagoon in my case).

In my case, I prefer to rely on taxonomic underclassification if they're reliable and believable, and using VSEARCH was a good solution for me.

Thanks again for your help and taking the time to answer my question.

Cheers!

system · March 16, 2023, 9:45pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.