Not able to run qiime dada2 denoise-paired on ec2 instance

danielavarelat · March 29, 2023, 3:22am

Before posting, please make sure you have the following information available, in order for us to help you in a timely manner:

Hello, I've been stucked in this issue for a couple of days...

I'm using an EC2 instance set up with qiime2 2020.11. This is the command I am trying to run.
My demux has 216 samples with these counts:

It's QUITE large (11GiB) so I had a lot of memory issues, finally I launched a c4.8xlarge instance. The problem is that even with --VERBOSE flag I don't know where I am having the problem. The job just stops running and as it's always running for a Looong time the ssh connection is often closed so I'm not able to see the verbose printuout always. I have tried to access the logs but I can not find them!!

This is the command:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f 13
--p-trim-left-r 13
--p-trunc-len-f 150
--p-trunc-len-r 150
--o-table table.qza
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats denoising-stats.qza --p-n-threads 14 --verbose --p-n-reads-learn 100000

It starts the process correctly and starts showing: 1) Filtering...
But then I don't see any output, it just stops.

I don't know what else to do here!!

timanix · March 29, 2023, 7:27am

Hi!
Looks like you are encountering memory issues due to large dataset (218 samples with a lot of sequences in each sample!).
You could try to decrease the number of threads (2-4) and if it possible to increase allocated memory (if you use HPC), or use a more powerful machine.

danielavarelat · March 29, 2023, 2:50pm

Thanks!! But how can I see the logs? I don't know where its failing or why?

This is was --verbose mode shows, no info about the error

gregcaporaso · April 3, 2023, 9:58pm

Hi @danielavarelat,
I don't think you'll be able to recover log files when the ssh connection is interrupted, and it's likely that the interruption of the ssh connection is causing the failure. As a next step, I recommend that you use tmux to allow your job to continue even if the ssh connection is interrupted. This will either allow the job to finish sucessfully, or allow you to re-connect to the server and see the full error message, which will include a path to the log file, if the job doesn't finish.

This post provides a good discussion of how to use tmux for this. It's likely that tmux is already installed on the system you're using, but if not you can install it or use screen.

Do you want to give this a try and let us know how it goes?

(Forum moderators: please correct me if I'm wrong and there is a reliable way to access the error log here.)

danielavarelat · April 4, 2023, 12:55pm

Hi, I was able to run it in the backgroup and save the --verbose to a file.
However, when moving into feature-classifier I have again a memory issue. I know there's a long discussion about it but I have tried everything and I can´t find a way to run it with silva.

qiime feature-classifier classify-sklearn  --i-classifier silva-138-99-nb-classifier.qza --i-reads repseq.qza --o-classification /home/qiime2/silva.taxonomy.qza --p-reads-per-batch 10000 --verbose

This is my command, and this is how my instance memory looks like:

gregcaporaso · April 4, 2023, 8:48pm

Hi @danielavarelat,
Glad to hear that you were able to get past the DADA2 step!

Memory issues when classifying with Silva is a known problem. Some relatively recent discussion of this, and tips, are consolidated in this post by @Nicholas_Bokulich.

@Mehrbod_Estaki also suggested that filtering low abundance features might help at this stage. You could do that with qiime feature-table filter-features --p-min-samples 2 ... (to include only features/ASVs that are present in at least two samples). You would do that filtering on your feature table, and then filter the features from your repseq.qza file using qiime feature-table filter-seqs --i-table .... This type of filter can reduce the feature count by as much as half sometimes, which can help a bit with memory.

Another alternative would be to use a different reference database for classification, such as Greengenes2 (classifiers available here) or GTDB (see here for details on how to train one of those).

danielavarelat · April 5, 2023, 3:44am

Thank you for replying. I will try that.
I was actually wondering whether I could just split my repseq (which is actually a fasta.fna file) into batches to predict the classification and then merge them together. Do you think that would work the same as running the whole file? if I'm not training, just classifying, I don't see how it would be different to split the file in batches than run it together.

Thanks for your help!

gregcaporaso · April 5, 2023, 8:46pm

@danielavarelat, for classifying that should work just fine.

system · May 7, 2023, 2:47am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.