qiime2 OTU picking with hundreds of samples

Nicholas_Bokulich · February 14, 2020, 6:01pm

Thanks for carrying on the discussion @MichelaRiba and @llenzi!

I am not positive this topic is relevant:

that is the vsearch-based taxonomic classifier, not OTU clustering. So some different things are going on under the hood and we may not be able to compare the two.
that topic is around 3 years old! I believe that the VSEARCH classifier in q2-feature-classifier was still being developed at that time... in any case, the runtime performance of the classifier is comparable to that of classify-consensus-blast so that comment is not actually accurate anymore: see here for a runtime comparison of these and other taxonomy classifiers. But again that probably is not relevant to OTU clustering.

So far nobody else has reported such unusually long runtimes — nor substantial differences in runtime between qiime1 (uclust) and QIIME 2 (vsearch). So as noted before I still suspect this is either due to characteristics of your dataset, or issues with utilizing resources on your cluster.

But at the end of the day this may just be a fact of life: OTU clustering on this particular dataset you have will take time. More aggressive quality filtering prior to OTU clustering will probably help cut down runtime considerably. Denoising methods will probably also be faster...

Nicholas_Bokulich · February 14, 2020, 6:12pm

@MichelaRiba,
here's another thought: maybe your tmpdir is running out of space?

this forum user mentioned uncovering an error with VSEARCH whereby it does not detect system errors correctly and just keeps running after tmpdir fills up:

This sounds remarkably similar to the symptoms you report.

colinbrislawn · February 14, 2020, 6:15pm

Long thread!

Have you been able to confirm that the worker nodes are running near 100% CPU usage after running for 595 hours? This sort of mysterious slowdown has happened to me; I start the process, everything is going full speed... then the database uses up all my ram and search speed drops to zero.

| reads | hours | reads/h |
|---|---|---|---|
| 3956736 | 1 | 3.9 M |
| 11327844 | 17 | .66 M |
| 89199952 | 595 | .14 M |

This is exactly the kind of speed falloff I would expect due to RAM being fully used. I just want to make sure you have plenty of RAM before we move on.

If you ssh into that the node running vsearch for 595 hours, I'll bet you that CPU usage is at 4% and you have 0.1 avail Mem

I think vsearch will throw error if tmp is used up, but if RAM is used up it will limp along without throwing an error.

MichelaRiba · February 17, 2020, 10:33am

Hi, thanks a lot again.
I am going to speak again with our System Administrator,
anyhow I would like to share that regarding tmp directory I can see

4.0K Feb 3 14:33 RtmpUnzDQd
2.0K Feb 13 11:19 tmpBDWTAW
2.0K Feb 13 11:19 tmpCaDSuc
2.0K Feb 13 11:19 tmpHBacsu
2.0K Feb 13 11:21 tmpVgwmMf
1.8K Feb 13 11:21 tmprTdcpB
2.0K Feb 13 11:23 tmp1nb1bV
1.8K Feb 13 11:23 tmp_jNVK8
2.0K Feb 13 11:23 tmpHUpmzB
2.0K Feb 13 11:26 tmpC0mgD5
2.0K Feb 13 11:27 tmpPCds0r
2.0K Feb 13 11:28 tmpWGibzD
2.0K Feb 13 11:29 tmp7BufKa
2.0K Feb 13 11:42 tmpVeL6zt
2.0K Feb 13 13:25 tmp6D2GR7
2.0K Feb 13 13:25 tmpkYASJo
2.0K Feb 13 13:26 tmpt7fF3a
2.0K Feb 13 13:31 tmpbwwhc8
1.8K Feb 13 13:32 tmpW4zVI1
4.0K Feb 13 13:33 qiime2-archive-d1d89k9_
4.0K Feb 13 13:35 qiime2-archive-f8wfe4yx
4.0K Feb 13 13:35 qiime2-provenance-wum623za
4.0K Feb 13 13:35 qiime2-archive-e2xkrnn_
3.9G Feb 13 14:09 tmpb4nhk1c4
715M Feb 13 15:19 tmplp_tfrhn
79M Feb 13 15:19 tmpar7ddpzp
78M Feb 13 15:19 q2-DNAFASTAFormat-grxixobv
0 Feb 13 15:43 q2-DNAFASTAFormat-3a4c6h1g

Those are all tmp files produced by vsearch closed reference (just following the suggestion to try closed reference).
The last file is empty from 4 days, this sound s pretty strange to me indeed. I do not have errors because I have error report only at the end of the job...
For sure the search clustering finished pretty fast and went in parallel (as we already discussed) using 12 threads in 12 cores, while the following phase, namely database mapping, which I see as "qiime" if I use top command in the node, please see afterwards, I see no parallel going (as discussed before, so fine) and the following reports in top command:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16270 mriba 20 0 12.8g 12g 31m R 99.9 1.2 5555:13 qiime

The fact that the tmp file is empty maybe refers to a problem of tmp space, as you suggested, while regarding to memory I am not able to understand if the value I see is the available memory and this is sufficient or is limiting

thanks again

Michela

colinbrislawn · February 17, 2020, 5:03pm

Hello Michela,

I think you are correct; this process might be progress slowly due to one of the non-parallized steps. Your System Administrator team might be able to find other clues about where and why this process is stalling.

We may have covered this already, but I want to make sure to mention another solution to your problem: using another denoising method, like DADA2.
DADA2 was designed big data, and might be a better fit for your data set than VSEARCH.

As a bonus, dada2 produces high resolution ASVs, which are arguably better than OTUs!

Colin

ismailp · February 18, 2020, 10:24pm

Sorry, I don't want to mix my own business into this thread, but I believe it's exactly the opposite. See xmalloc fails are fatal. Return values of file system functions, however, are not handled in vsearch, except fopen. It also checks out with my own experience with full /tmp partition while running vsearch.

colinbrislawn · February 19, 2020, 2:16pm

Thanks for jumping in! I appreciate your insight into this

To the /tmp/ drive!

Colin

P.S. Does xmalloc fail when allocating ram disk, or would it only fail when there is no memory left of any kind?

MichelaRiba · February 21, 2020, 9:40am

Hi,

thanks a lot, I would check when our cluster will be up again

MichelaRiba · February 21, 2020, 9:41am

Thanks a lot for the additional suggestion, I will check when our cluster would be up

Michela

ismailp · February 26, 2020, 6:19pm

You're welcome. xmalloc uses a platform-specific aligned allocation function. In both POSIX and Windows case, allocation functions try to request heap space from the operating system. I am not so familiar with Windows. Allocation fails if the operating system says there is no memory. That often means there is no memory left of any kind. There are many reasons why this might fail, including no physical memory left and swapping is disabled, or swapping is enabled but there is no free swap space, user has exceeded their quota, or memory fragmentation (although there is memory available, there is no contiguous block of requested size).

thermokarst · March 12, 2020, 1:12pm

5 posts were split to a new topic: Performance Investigation of vsearch and q2-vsearch