qiime2 OTU picking with hundreds of samples

Hi,

I am writing because I have troubles in scaling up OTU vsearc clustering, I have already posted in the user support forum, but I did not solve. I am just asking for people who run hundreds of samples in parallel, I do not have any error since the procedure does nor end :slight_smile:

I am running quite 2 search from a conda installation, after dereplicating sequences, in the following way. The command will be launched using a variant of make

joined_import_filter_derep_OTU:
echo PBS -N star -l select=2:ncpus=12:mem=$(STAR_RAM)gb;
condactivate qiime2.1 ;
qiime vsearch cluster-features-open-reference
–i-table table.qza
–i-sequences rep-seqs.qza
–i-reference-sequences 85_otus.qza
–p-perc-identity 0.85
–p-threads 8
–o-clustered-table table-or-85.qza
–o-clustered-sequences rep-seqs-or-85.qza
–o-new-reference-sequences new-ref-seqs-or-85.qza
–verbose

I succeed in performing clustering for OTU using 30 samples and the sugested options for parallelisation. I am now waiting for a job run-in 200 samples but it seems needing a lot of time, while if I compare the same phase (OTU clustering in qiime1) that did not last so long (a couple of days with qiime1 and more than 2 weeks with qiime2). is there anybody who has experience with such numerosity of samples?
Thanks a lot!

Hoping this would be sufficient for you,
please tell me if you think you need more

the question however is very simple: parallelisation with hundreds of samples already run with qiime1 in less time

Did you have any similar feedback from users?
Thanks a lot

Michela

Hi @MichelaRiba,
We have not forgotten about you — please hang in there.

your issue is quite similar to one that you reported earlier with a smaller number of samples: qiime2 OTU picking

we have not been able to replicate this issue so far and have not had others report an issue like this — the sort of conclusion we came to on that previous topic was that you are not allocating cluster resources correctly.

How about we pick up there on this topic — please check out the resource use of the finished jobs for comparison. You should check with your system admin to see what qsub command will give you a report of total CPU, RAM, etc used by a finished job (I know there is such a command in slurm, which I use, but don't know the equivalent for qsub but expect it must exist).

Something abnormal is occurring here. 200 is not an enormous # of samples at all (though it's the number of sequences that will matter for OTU clustering, not the number of samples, I am just assuming this translates to a "normal" # of sequences per sample). OTU clustering can be a time-consuming process but not weeks — I have run many studies consisting of 200 samples or more on a dusty old laptop within a couple of hours (with QIIME 2). This is why I really suspect something is going wrong with resource allocation.

How many sequences are you attempting to cluster? How many sequences did you have prior to dereplication? What type of sequences are you attempting to cluster? (16S?) and how long are the sequences?

1 Like

Hi,

thanks a lot!

Meanwhile I did some trials:

3956736 sequences has been aligned in nearly 1 hour,

11327844 sequences in 17 hours

89199952 in nearly 595 hours

(I am reporting sequences before quality filter and dereliction, anyhow I imagine that could be sufficient for the proportions, right?

Thanks a lot

Michela

I additionally found that parallelising the vsearch using some threads
let me see a cpu usage that some % uf CPU usage (e.g. 12 cores, 12 threads, vsearch went at 1100 % or more, seems very good) but after vserch the command goes on qiime (I imagine here the step of database comparison) and I see always less than 100%, meaning to me that this command goes parallel for the vsearch part for OTU clustering, but not for the subsequent ones, isn’t it?

Thank you so much

Michela

That's correct. VSEARCH is performing the alignment, then QIIME 2 maps those sequences to OTUs and builds a feature table from them and that step is not being parallelized. It sounds like that step is probably dragging, which probably indicates an unusually large number of features, i.e., clustering may not be resulting in much dereplication? How many features do you have vs. input dereplicated sequences?

What sort of filtering are you performing prior to dereplication/clustering? If using OTU clustering alone (no denoising), then you should do some fairly aggressive filtering of the raw sequences and that will help reduce runtime.

Could you also check memory usage? I wonder if you are hitting the max memory requested causing that step to lag?

Hi, thanks a lot again for the follow up!

I am sorry, I did not write well the pre-processing steps:
prior to vsearch clustering I did:
quality filtering on joined sequences and
vsearch dereplication,
if I extract data from that step I can find a fasta file of sequences, for example in my project lasting just 17 hours, with nearly 30 samples
that fasta file (input for vsearch clustering, if I am correct) has
7441021 lines.

Regarding the issue of memory I cannot see problems, or perhaps I did not check the right way, however:
I had this report:
Resources:
Limits: mem=128gb,ncpus=36,place=free
cpupercent=995,cput=17:04:44,mem=2528064kb,ncpus=36,vmem=18674140kb,walltime=15:33:29

I have exaggerated the requests, perhaps not the correct way, because in the end the process enters in only one node (with 18 cores, parallelized to 12 threads, maybe it is better to set cores=threads as my system manager suggests.

In addition I have found this additional point about parallelisation and tried to write for feedback,

there it seemed maybe with large datasets we ca wait long, is that right?

Michela

Hi, meanwhile I am re-doing the procedure with 126 samples using qiime1 uclust for out picking (pick_open_reference_otus.py) and I obtained the results in 4 hours,... consider that it took 15 or so hours for qiime 2 search to process 30 samples.
May I conclude that qiime1 is still a good idea for processing samples in less time?

Thanks a lot

Michela

Hi @MichelaRiba!

I'm not sure --- QIIME 1 is not longer supported, so I cannot recommend it.

I have a question about this report:

Is this what you requested, or is this what was used? I ask because this bit jumps out at me:

This is roughly 2 GB of memory, and I suspect it is a report of how much memory was used - which means requesting 128 GB could be a gross over-allotment.

Also, for your elapsed time reports - are these the total time elapsed from submission, or the total time from start of execution? On my institutional cluster I often have to wait to get a large allocation upon submission, the delay between submission and start of execution can be pretty long (dozens of hours). Usually this delay is impacted by the size of the resources requested - the more requested, the longer the delay.

I'm still not 100% sure what is going on here, but I just did a bit of benchmarking and couldn't find any obvious bottlenecks on this end, with the exception of vsearch (the underlying tool that q2-vsearch uses). I think if I were in your position I would try to sit down with the sysadmin of this cluster/computation resource and verify that the resources are being requested correctly, and that they are also being allocated correctly.

Sorry that's not a super interesting answer, but I hope it helps.

:qiime2:

1 Like

Hi,
thanks a lot again for reply!

Sorry for not commenting inline:

  • I discussed with System admin

  • I checked by entering the node where the job was launched that this was immediately running, and this was the case

  • about reports: they refer to the used resources,
    CPUTime I personally checked entering the node where the job was lunched by make and see that using _top_command I could see qiime running, in detail first qiime with CPU usage 99% roughly then vsearch with 1150% as parallelized I thing (12 threads in twelve cores)
    then again for the most part of time the top command shows me qiime with 99% CPU usage.
    For this reason I do not think we did it wrong but instead the steps in the command I report below are not all paralelized since in the real situation. only the vsearch really uses 12 CPUs.
    It will be very useful for me to understand what happens after the vsearch part in the command, if it is not clear I will explain again.
    This not seems to me a problem of memory, because perhaps it is too much, for sure it is used 0,5 % to 7%, but a issue of parts not parallelized. May you dissect the workflow of this command to be able to see where is the bottleneck?

  • may you comment on the post I mention where another user commented about problems with low speed in other commands ?

qiime vsearch cluster-features-open-reference
–i-table table.qza
–i-sequences rep-seqs.qza
–i-reference-sequences 85_otus.qza
–p-perc-identity 0.85
–p-threads 12
–o-clustered-table table-or-85.qza
–o-clustered-sequences rep-seqs-or-85.qza
–o-new-reference-sequences new-ref-seqs-or-85.qza
–verbose

Thanks a lot

Michela

Thanks for the answers, @MichelaRiba. Just curious, how many of your features are not being clustered in the closed-reference step of this command? If you are seeing a significant number miss the closed-ref clustering, then I think this long runtime makes a lot of sense to me, because
a) the closed ref step is long and slow and if many sequences aren't clustered here you wind up with
b) lots and lots of sequences at the de novo clustering step, which is also very long and slow (I suspect slower than the closed ref step

One option to further diagnose is to manually perform your own open-reference clustering by breaking up this pipeline into its discrete commands:

  1. cluster_features_closed_reference
  2. filter_features to remove any features from feature table produced provided to step 1 that weren't clustered in step 1
  3. cluster_features_de_novo using the unmatched seqs from 1 and the filtered table from 2
  4. merge the results

vsearch is the bottleneck in all the tests I have performed (if you wish to send some data to me I can take a closer look).

The post you have linked to isn't specific to this discussion and is a general question asking about parallelization strategies. Not all methods can be parallelized, however this one can (and already is).

Keep us posted!

:qiime2:

1 Like

Hi,

thanks a lot for you answer.

Good point to separate the denoto from the rest, may you be more practical to tell where I can see it in the scripts called by vsearch?
I suppose that vsearch cluster-features-open-reference does exactly those steps, if you could tell me where I can see the commands I could dissect the workflow.

Thanks a lot

Michela

joined_import_filter_derep_OTU:
echo PBS -N star -l select=2:ncpus=12:mem=(STAR_RAM)gb;\ condactivate qiime2.1 ;\ qiime vsearch cluster-features-open-reference \ --i-table table.qza \ --i-sequences rep-seqs.qza \ --i-reference-sequences 85_otus.qza \ --p-perc-identity 0.85 \ --p-threads (CORES)
–o-clustered-table table-or-85bis.qza
–o-clustered-sequences rep-seqs-or-85bis.qza
–o-new-reference-sequences new-ref-seqs-or-85bis.qza
–verbose

May I follow for example your page:


Here I can see

  • Closed reference OTU picking with vsearch
    which produces a file of unclustered

  • I would took those unclustered and apply instead open reference with 99% identity instead of 85%

Supposing (going to test) that the second phase would last a long time
I could take some additional decision

  • Then How could I merge the results?

Thanks a lot

Michela

The general steps are here:

The specific steps are here:

Hope that helps!

1 Like

Hi

Thanks a lot!!

I am currently following MicrobiotaMi meeting and I will be hands on again next week

Michela

Hi,

I am running VSEARCH closed reference OTU picking, it seems again to be very slow with the complete dataset, total of 8,119,151 sequences (after quality filtering and de-replication).
I link this post, where it seems that vsearch classification may be slower than other methodologies

"We are currently recommending that users avoid using classify-consensus-vsearch for more than tens of sequences.

Fortunately, classify-consensus-blast gives very similar performance to classify-consensus-vsearch in terms of accuracy, but in our tests runs 50 times faster. If run time is still an issue, classify-sklearn was 500 times faster in our tests. There is a tutorial for how to use classify-sklearn here ."

May be that my problem in vsearch OTU picking relates to that?

Thanks for your patience

Michela

1 Like

Ciao @MichelaRiba,

I have not much experience with vsearch in QIIME2, but if I recall correctly, one of the differences between vsearch and blast is that blast keeps searching until it finds the specified number of matches above the threshold for an OTU, on the other side vsearch score an OTU against all the sequences in the database then gets the top-best hits. At least should be true for the classify-consensus-vsearch vs classy-consensus-blast, as per plug in description.

So, would make sense to me for vsearch to be more time and memory consuming for a big dataset (also depending on the size of the database).

Luca

Hi,
thanks a lot for your kind reply.

This helps me a lot to understand why I am experiencing that very long running time

Michela