Runtime Estimation using vsearch cluster-features-open-reference

Hello, I've been running OTU clustering of ~500 bp dereplicated 18S V4-V5 amplicon reads. I have quite a large dataset of 54 samples totaling ~84M reads. In hindsight, there might have been a better way to split my samples into smaller groups for separate, less-intensive analysis, but I have instead been stuck running the clustering on 16 threads for 143 hours and counting.

I would like to gather opinions on runtime estimation and advice on whether or not it is worth waiting for the run to finish in its current state. I hope to finish said run within the next week since this is done on a shared server. If not, I would probably end the run now rather than wait for what could take months.

I have done clustering before on smaller datasets and it took an hour or two, so I suspect that the lack of physical memory on the server and the large allocation in the Swp memory is causing the slowdown. Any opinions will be greatly appreciated.

qiime vsearch cluster-features-open-reference
--i-table seqs_derep_table.qza
--i-sequences seqs_derep.qza
--p-perc-identity 0.97
--i-reference-sequences PR2_enriched_V4V5_ref_seqs.qza
--o-clustered-table otu-table.qza
--o-clustered-sequences otu-seqs.qza
--p-threads 16
--o-new-reference-sequences otu-ref-seqs.qza

Good morning RielAlfonso,

Thank you for providing your full command and resource usage screenshot!

That is the problem, unfortunately. Your options are limited on this hardware.

Runtime results from both size and complexity, so it's hard to estimate.
Here, it does not matter; Overflowing memory into swap is causing the slowdown, so normal estimates don't apply.

Here are some options, from easy to hard:

  • cancel this job, then rerun it by itself (you already knew this)
  • switch from OTU clustering to ASVs, as you can run those in batches
  • get an account on your institution's HPC
  • rent a bigger computer (from the commercial cloud :money_with_wings: :cloud: )

Let us know what you try next.

I do have access to another server with 256GB of RAM, but it's currently working on its own set of shotgun sequences. This might be leaning towards a different support category, but would it be acceptable to run OTU clustering in batches, instead of ASVs, considering I use the reference sequences output as the centroids reference for the proceeding batches? I could then combine the feature tables and representative sequences for taxonomy assignment and core metrics analysis. Would it still be statistically acceptable to do so if my main goal was to compare community structure?

1 Like

I think so. This sounds a lot like open-ref, which was a popular approach 10ish years ago. It's still valid, though reviewer three is going to complain :upside_down_face:

If you are considering close-ref methods, deblur is designed for big data and supports single-nucleotide resolution:

Thanks for the advice, I decided to proceed with the ASV approach as you suggested. I am now assigning taxonomy instead of being caught up in the sunk-cost fallacy of OTU clustering.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.