Performance Investigation of vsearch and q2-vsearch

MichelaRiba · March 6, 2020, 1:28pm

Hi,

I am still proceeding in troubleshooting.
I write to report some information about search closed reference
qiime vsearch cluster-features-closed-reference
--i-table table.qza
--i-sequences rep-seqs.qza
--i-reference-sequences 85_otus.qza
--p-perc-identity 0.85
--o-clustered-table table-cr-85.qza
--o-clustered-sequences rep-seqs-cr-85.qza
--o-unmatched-sequences unmatched-cr-85.qza
--verbose

hereafter I report the output of verbose option.
I noticed in particular:
vsearch v2.7.0_linux_x86_64, 70.8GB RAM, 12 cores, which I do not understand, in the sense that I do not require those resources. I am running on the cluster in a single node.

7276798 nt in 5088 seqs, min 1258, max 2353, avg 1430

Masking 100%

Counting k-mers 100%

Creating k-mer index 100%

Searching 100%

Matching query sequences: 7958865 of 8119151 (98.03%)

Running external command line application. This may print messages to stdout and/or stderr.

The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --sortbysize /lustre2/scratch/tmp/tmp6p2i6yqj --xsize --output /lustre2/scratch/tmp/q2-DNAFASTAFormat-9k_7183q

vsearch v2.7.0_linux_x86_64, 70.8GB RAM, 12 cores

GitHub - torognes/vsearch: Versatile open-source tool for microbiome analysis

Reading file /lustre2/scratch/tmp/tmp6p2i6yqj 100%

73054219 nt in 160286 seqs, min 302, max 596, avg 456

Getting sizes 100%

Sorting 100%

Median abundance: 1

Writing output 100%

In addition to that I am reporting, with the following command, measures of CPU and memory usage:

while ps -p 31835 --no-headers --format "etime pid %cpu %mem rss"; do sleep 10; done > test.txt

31835 refers to the process I want to monitor.

I saw that the command basically starts with one process, namely qiime (if seen with a top command)
and that is the process I mention before,
afterwards it goes into vsearc, with another process. That process terminates, however I still see the abovementioned one which is still going and staying at what reported by the verbose option it is still going.

Thanks a lot if you could tell me what is making the program staying that way.

Thanks a lot again

Michela

thermokarst · March 6, 2020, 9:22pm

That vsearch message is just reporting stats about the node the command is currently running on - apparently it has 70.8 GB RAM and 12 cores (a core is different than a node - think "computer" for node and "CPU" for core).

Which process was that? vsearch?

I'm not sure I understand the question - in what way is the program staying?

Thanks!

MichelaRiba · March 7, 2020, 1:26pm

Thanks a lot first of all!

Hi as I was mentioning in the previous post, there is an apparent split into two processes:
namely the 31835 which starts but does not end, and appears as qiime in the "top" command

and another one, which starts at a certain point, is called vsearc in "top" command
and goes to finish, while the other one is still running.

I wrote the info to highlight the number of sequences I am dealing with

I am although pretty surprised that you ask about features that for sure should happen to everybody running the program
I am mentioning the fact of the two processes, I imagine this happens to everybody launching that vsearch command
could you please comment on that?

Why one process starts,
than stars another one
and the first one does not end?

How it is related to how the program is conceived?

I thank you very much in advance
I am convinced that it is that stalling process, named "qiime" which is stalling

and also convinced we should solve!

In addition I saw the temporary files and apparently
-rw------- 1 mriba ctgb 78M Mar 5 20:14 q2-DNAFASTAFormat-9k_7183q
-rw------- 1 mriba ctgb 0 Mar 5 20:29 q2-DNAFASTAFormat-a_u4dtn_
still at 0 from 3 days, is that the stalling point?

thermokarst · March 9, 2020, 10:58pm

Of course

Yes, but what is the name of the process? The number "31835" is a number specific to the machine you ran it on, and isn't anything we can use for diagnosis. What was the name of the process, or the command that was being run? I suspect it is the "qiime" CLI tool that you are seeing.

I'm sorry, I don't understand the question. I will try to explain a little bit about how QIIME 2 works, maybe that will answer your questions? QIIME 2 is an interface-agnostic framework, that facilitates functionality defined in plugins. Various user interfaces coordinate with plugins, through the QIIME 2 framework. When you run a q2-vsearch command via q2cli (the command line interface for QIIME 2), the q2cli tool coordinates file inputs with the framework, which then passes that info on to the plugin. In this plugin the vsearch tool is used, so after the framework hands off data inputs to the plugin, which then uses that data. When vsearch is done, it hands its outputs back to the framework, which validates and packages the results, which then get handed off the q2cli. Check out this diagram for more detail:

https://dev.qiime2.org/latest/architecture/#following-a-command-through-qiime-2

If you are seeing long-running processes, I am not surprised, because the q2cli and framework both have to run at least as long as vsearch.

I am not sure why the process of saving the data would hang for you like this - as best I know you are the only user to report this particular problem with saving being slow for this particular action. I am wondering if your cluster is somehow cleaning up processes or temp space while your command is running, which might cause things to stall. If you want to share data with us to attempt to reproduce (or verify) please send me a link in a private message.

MichelaRiba · March 10, 2020, 9:40am

Hi,

thanks a lot.

Sorry I specify and summarize:

I am launching the following:
qiime vsearch cluster-features-closed-reference
--i-table table.qza
--i-sequences rep-seqs.qza
--i-reference-sequences 85_otus.qza
--p-perc-identity 0.85
--p-threads $(CORES)
--o-clustered-table table-cr-85.qza
--o-clustered-sequences rep-seqs-cr-85.qza
--o-unmatched-sequences unmatched-cr-85.qza
--verbose

I am launching in a node in a cluster
I do not ask parallelisation this time;

I launch in a screen session
I open another terminal
I connect to the node, I execute top command,
I see first a process:

31835, called qiime, I start monitor

while ps -p 31835 --no-headers --format "etime pid %cpu %mem rss"; do sleep 10; done > test.txt

afterwards I see another process
3834, referred "vsearc"

I monitor
while ps -p 3834 --no-headers --format "etime pid %cpu %mem rss"; do sleep 10; done > testvsearc.txt

I launch those monitoring commands in separate screen:
The monitoring of 3834 finished

but the other one not, namely the qiiime, 31835, so in line with your comment, this does proceed even more than vsearc.

In addition to that if I go to the temporary files I see the last produced one at 0, this seems strange to me, I mean I imagine this should increase in size

regarding the possibility to share data I should ask permission also due to COVID emergency in Italy maybe this can be slower than usual (to get permission)
may I ask if I could on the opposite process some data you know and give you feedback,

I have to confirm anyhow that:

in my personal Mac book I can process subsampled samples very fast and I cannot even start the monitoring process
on the cluster I tried different datasets and with some delay I could process 6, 30 samples.
I am pretty convinced that if you could I should arrive to do on our system!

I start the pipeline from a manifest entering already joined fast.
May I ask in case to give me the same kind of data?

Perhaps I could also use only a subset, namely 10 samples from my dataset and give you the results of those monitoring statistics, but if the problem refers to temporary files this would not solve.

How could I give you feedback on temporary files?
Could you suggest a way to monitor what is happening in the cluster about them?
(besides asking to the system manager who at those time I presented the problem did not gave that very importance, maybe I have to strengthen the request if we have evidence this is the problem)

How are you dealing with Sars virus?
Everything OK for you?

Thanks a lot

Michela

thanks a lot

In addition I cleaned up all my tmp folder

would like to see if with more space this should be solved?

Thanks really

Michela

thermokarst · March 13, 2020, 10:07pm

Great, thank you - this summary makes a lot of sense, and I think it aligns with my last post, too.

Yeah, something is strange here, and is not something I have seen before - is it possible that something related to the cluster queueing system is interfering here?

I'm so sorry to hear that, take your time.

I can take a look - my first thought is to find a public dataset on qiita and compare results. I will let you know.

You know, another thought just popped into my head - you could override the temporary files directory by setting a variable in your job submission script:

export TMPDIR='/some/path/that/is/writable'

Thanks for asking, things are quiet in northern Arizona, where much of the QIIME 2 core team lives - we are seeing a lot of social distancing techniques begin to be used (lots of travel and events cancelled, for example).

I hope all is well with you!

:qiime2:

MichelaRiba · March 16, 2020, 9:02am

Hi,

thanks a lot for kind help and discussion.

I'd like to share with you the progress of my test:
This is the verbose output still on this (from 13th March)

Command: vsearch --sortbysize /lustre2/scratch/tmp/tmpthopu39h --xsize --output /lustre2/scratch/tmp/q2-DNAFASTAFormat-gzqu60t0

vsearch v2.7.0_linux_x86_64, 70.6GB RAM, 12 cores

Reading file /lustre2/scratch/tmp/tmpthopu39h 100%
73054219 nt in 160286 seqs, min 302, max 596, avg 456
Getting sizes 100%
Sorting 100%
Median abundance: 1
Writing output 100%

if I go to the tmp file
drwx------ 3 mriba ctgb 4096 Mar 13 16:54 qiime2-archive-giyaeuhw
drwx------ 3 mriba ctgb 4096 Mar 13 16:54 qiime2-archive-rbko_jei
drwx------ 4 mriba ctgb 4096 Mar 13 16:54 qiime2-provenance-v09mc387
drwx------ 3 mriba ctgb 4096 Mar 13 16:54 qiime2-archive-glua0o64
-rw------- 1 mriba ctgb 4119096406 Mar 13 17:15 tmpragxymb1
-rw------- 1 mriba ctgb 81886213 Mar 13 19:21 tmpthopu39h
-rw------- 1 mriba ctgb 748824247 Mar 13 19:21 tmp6mo50z5a
-rw------- 1 mriba ctgb 80763834 Mar 13 19:21 q2-DNAFASTAFormat-gzqu60t0
-rw------- 1 mriba ctgb 0 Mar 13 19:36 q2-DNAFASTAFormat-futhqjlb

the highlighted file does not seem to be increasing and the last one is still at 0

Good things
monitoring of the processes with ps command let us see that
mriba@b028 test]$ ps x
PID TTY STAT TIME COMMAND
2344 pts/3 S+ 21:17 /lustre1/ctgb-usr/local/miniconda3/envs/qiime2.1/bin/
2448 ? Ss 0:00 SCREEN -S ps
2449 pts/1 Ss 0:00 /bin/bash
4555 pts/3 Sl+ 347:17 vsearch --usearch_global /lustre2/scratch/tmp/tmpragx
4682 ? Ss 0:00 SCREEN -S vsearc
4683 pts/5 Ss 0:00 /bin/bash
11427 ? S 0:00 sshd: mriba@pts/4
11428 pts/4 Ss 0:00 -bash
12112 pts/1 S+ 0:00 sleep 10
12116 pts/5 S+ 0:00 sleep 10
12122 pts/4 R+ 0:00 ps x
29867 ? S 0:00 sshd: mriba@pts/0
29868 pts/0 Ss+ 0:00 -bash
32136 ? Ss 0:00 SCREEN -S Michaela
32137 pts/3 Ss 0:00 /bin/bash

process 4555 is now ended
( I have recordings about use of CPU and RAM) and was parallelized.

Now the command gives:
ps x
PID TTY STAT TIME COMMAND
2344 pts/3 R+ 3762:01 /lustre1/ctgb-usr/local/miniconda3/envs/qiime2.1/bin/python /lustre1/ctgb-
2448 ? Ss 0:00 SCREEN -S ps
2449 pts/1 Ss 0:15 /bin/bash
4682 ? Ss 0:00 SCREEN -S vsearc
4683 pts/5 Ss+ 0:00 /bin/bash
29243 ? S 0:00 sshd: mriba@pts/0
29244 pts/0 Ss 0:00 -bash
30650 pts/1 S+ 0:00 sleep 10
30652 pts/0 R+ 0:00 ps x
32136 ? Ss 0:00 SCREEN -S Michaela
32137 pts/3 Ss 0:00 /bin/bash

process 2344 is still running, that is qiime, meaning that is doing something,

top command
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2344 mriba 20 0 12.854g 0.012t 1508 R 100.0 17.4 3768:09 qiime

I am now adding some reports from ps recordings over "qiime" process:
while ps -p 2344 --no-headers --format "etime pid %cpu %mem rss"; do sleep 10; done > testb028qiime.txt
head of file on ps recording of CPU and RAM, please see above on process 2344

02:20 2344 99.9 4.2 3172760
02:30 2344 99.9 4.2 3172760
02:40 2344 99.9 4.2 3172760
02:50 2344 99.9 4.2 3172760

and tail of file

5-16:13:58 2344 98.4 17.3 12849844
5-16:14:08 2344 98.4 17.3 12849844
5-16:14:18 2344 98.4 17.3 12849844

In addition, after 6 days the tmp files are still:
-rw------- 1 mriba ctgb 78M Mar 13 19:21 q2-DNAFASTAFormat-gzqu60t0
-rw------- 1 mriba ctgb 0 Mar 13 19:36 q2-DNAFASTAFormat-futhqjlb

that file is still at 0

Thanks a lot again for your patience.

If you see something completely wrong from my side please tell me!

thanks a lot

I report in addition that the procedure in the past months (open reference instead) had been successful with this dataset, however it lasted 25 days

Thanks a lot.
In Italy we are locked at home , now it is the only way to slow down the global infection. If you could start that kind of policies in advance for sure this would help.
A part from isolating the infected cases that should be done and sometimes was not done properly as some focal outbreaks showed.
I send you the link to CDC suggestions
https://www.cdc.gov/coronavirus/2019-ncov/prepare/checklist-household-ready.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fcommunity%2Fhome%2Fchecklist-household-ready.html

Michela

MichelaRiba · March 24, 2020, 8:59am

Hi, do you have news for me? Thanks a lot

Michela

thermokarst · March 24, 2020, 2:54pm

Hi @MichelaRiba - no news yet, this is still on my TODO list. Things have been a bit disrupted around here lately (the Flagstaff QIIME 2 team is now mostly all remote). Hopefully in the next week I will have something for you.

Thanks!
:qiime2:

MichelaRiba · March 25, 2020, 7:57am

Thanks a lot for you kind reply and I appreciate a lot the effort and gift you are preparing for me, this is of great value! Grazie. Michela

MichelaRiba · March 25, 2020, 1:50pm

@thermokarst, Hi,
I am writing pretty happy. I have just found the output of the testing ps, could I attach those files?

Now I send just the tail of the ps over the qiime command

11-12:13:02 2344 99.2 25.6 19021720
11-12:13:12 2344 99.2 25.6 19018520
11-12:13:22 2344 99.2 25.6 19018520
11-12:13:32 2344 99.2 25.6 19018520
11-12:13:42 2344 99.2 25.6 18994712
11-12:13:52 2344 99.2 25.5 18947864
11-12:14:02 2344 99.2 25.0 18547480
11-12:14:12 2344 99.2 15.9 11835188
11-12:14:22 2344 99.2 11.8 8745072
11-12:14:32 2344 99.2 11.7 8736744

That refers to the previous command: vsearch closed reference with 4 threads.

Maybe you could comment, regarding time and number of iniziativa sequences.

Do you think I could improve slowing down the necessary time?

Do you think that in addition to that I using open reference I could spend much more time?

I wait for you tests, but I wanted to share with you that little happiness

That is the original verbose output
Command: vsearch --sortbysize /lustre2/scratch/tmp/tmpthopu39h --xsize --output /lustre2/scratch/tmp/q2-DNAFASTAFormat-gzqu60t0
vsearch v2.7.0_linux_x86_64, 70.6GB RAM, 12 cores

Reading file /lustre2/scratch/tmp/tmpthopu39h 100%
73054219 nt in 160286 seqs, min 302, max 596, avg 456
Getting sizes 100%
Sorting 100%
Median abundance: 1
Writing output 100%
Saved FeatureTable[Frequency] to: table-cr-85.qza
Saved FeatureData[Sequence] to: rep-seqs-cr-85.qza
Saved FeatureData[Sequence] to: unmatched-cr-85.qza

I am now adding results regarding unmatched sequences:
`

unzip -d unmatched_seq unmatched-cr-85.qza
...  
 less dna-sequences.fasta | grep ">" | wc -l
  160286

unzip -d repseqs  rep-seqs-cr-85.qza
...
less dna-sequences.fasta | grep ">" | wc -l
1541

If I am understanding that correctly I see a very low number of matched sequences, isn't it? If compared with the file of unmatched
Thanks a lot
Michela

thermokarst · April 17, 2020, 11:53pm

Just a quick update - I have been working on this a bunch this week - I'll provide more details later, but I think you identified a pretty severe performance bug in the closed reference clustering algorithm. More soon!

MichelaRiba · April 18, 2020, 8:21am

Hi,

Thanks a lot for your feedback, very kind of you. I am looking forward to your updates!

I was pretty detached those past days since I was helping grant application writing, but I'm going to be "more practically" involved again pretty soon

Have a nice weekend

Michela

thermokarst · April 28, 2020, 2:18pm

Hi @MichelaRiba! Okay, time for that update that I promised. I started working on a simple analysis using the dataset provided here (link). First, I did a de novo clustering - it was quickly done. Then I moved on to closed-reference clustering - I observed the same issue you found - vsearch is complete rather quickly (maybe 10-20 minutes), but then something is hung up and going crazy slow. I began a detailed debugging, and identified a huge performance bug in the q2-vsearch code. I owe you an apology - thank you for sticking with it! Basically, the bug only presents itself in certain cases - namely when you have many many many unique features, before clustering. I have submitted a patch (link), and the current plan is for this to be included in the upcoming QIIME 2 2020.5 release (link). I think at this point there is no need for your attempt to reproduce. Thank you again for your perseverance, this was a hard one to diagnose and reproduce! Thanks!

:qiime2:

MichelaRiba · April 28, 2020, 3:25pm

Hi,
thanks a lot for this!
That was an interesting thing for me because actually it went to an end so this was functioning!
I am really enthusiastic for this and I thank you really for your patience and consideration.

I would just point a minimal thing:
Are you planning to put that patch also in the open reference vsearc OTU picking? That would be great, if it is already in, great indeed!

thermokarst · April 28, 2020, 3:37pm

Yes! The open-reference command is what we call a "Pipeline" in QIIME 2 - it is made up of other QIIME 2 actions, so, any changes to closed-ref will automatically be included in the open-ref pipeline!

:qiime2:

MichelaRiba · April 28, 2020, 4:25pm

Hi,
That's fantastic.

I think I am going to ask our system manager to install the new version
QIIME 2 2020.5 (when it will be finalised)

and move to using

thanks a lot!
Michela

system · May 29, 2020, 10:25pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.