It's QUITE large (11GiB) so I had a lot of memory issues, finally I launched a c4.8xlarge instance. The problem is that even with --VERBOSE flag I don't know where I am having the problem. The job just stops running and as it's always running for a Looong time the ssh connection is often closed so I'm not able to see the verbose printuout always. I have tried to access the logs but I can not find them!!
Hi!
Looks like you are encountering memory issues due to large dataset (218 samples with a lot of sequences in each sample!).
You could try to decrease the number of threads (2-4) and if it possible to increase allocated memory (if you use HPC), or use a more powerful machine.
Hi @danielavarelat,
I don't think you'll be able to recover log files when the ssh connection is interrupted, and it's likely that the interruption of the ssh connection is causing the failure. As a next step, I recommend that you use tmux to allow your job to continue even if the ssh connection is interrupted. This will either allow the job to finish sucessfully, or allow you to re-connect to the server and see the full error message, which will include a path to the log file, if the job doesn't finish.
This post provides a good discussion of how to use tmux for this. It's likely that tmux is already installed on the system you're using, but if not you can install it or use screen.
Do you want to give this a try and let us know how it goes?
(Forum moderators: please correct me if I'm wrong and there is a reliable way to access the error log here.)
Hi, I was able to run it in the backgroup and save the --verbose to a file.
However, when moving into feature-classifier I have again a memory issue. I know there's a long discussion about it but I have tried everything and I can´t find a way to run it with silva.
Hi @danielavarelat,
Glad to hear that you were able to get past the DADA2 step!
Memory issues when classifying with Silva is a known problem. Some relatively recent discussion of this, and tips, are consolidated in this post by @Nicholas_Bokulich.
@Mehrbod_Estaki also suggested that filtering low abundance features might help at this stage. You could do that with qiime feature-table filter-features --p-min-samples 2 ... (to include only features/ASVs that are present in at least two samples). You would do that filtering on your feature table, and then filter the features from your repseq.qza file using qiime feature-table filter-seqs --i-table .... This type of filter can reduce the feature count by as much as half sometimes, which can help a bit with memory.
Another alternative would be to use a different reference database for classification, such as Greengenes2 (classifiers available here) or GTDB (see here for details on how to train one of those).
Thank you for replying. I will try that.
I was actually wondering whether I could just split my repseq (which is actually a fasta.fna file) into batches to predict the classification and then merge them together. Do you think that would work the same as running the whole file? if I'm not training, just classifying, I don't see how it would be different to split the file in batches than run it together.