Error following COI database tutorial

Hi there,

I’ve been following this amazing tutorial to build my own COI database for a gut content metabarcoding project (which is also in fact my first metabarcoding project, so very new to the field). Up until the “Step 3 – Dereplicating”, everything went fine, but then I ran into some trouble.

When trying to run the qiime rescript dereplicate command, I cannot obtain the desired output, and only get a “Killed” final message. After running the command with the --verbose option, I got the following outputs, which didn’t really help me to figure out the source of the problem:

qiime rescript dereplicate --i-sequences bold_ambi_hpoly_length_filtd_seqs.qza --i-taxa bold_rawTaxa.qza --p-mode 'super' --p-derep-prefix --o-dereplicated-sequences bold_derep1_seqs.qza --o-dereplicated-taxa bold_derep1_taxa.qza --p-threads 3 --verbose

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_prefix /tmp/qiime2/martindogniez/data/056f2a80-816e-4d12-a1e6-2ad92a9e69d3/data/dna-sequences.fasta --output /tmp/tmpb5l1koss --uc /tmp/tmpznh2_er8 --xsize --threads 5

WARNING: The derep_prefix command does not support multithreading.
Only 1 thread used.
vsearch v2.22.1_linux_x86_64, 15.1GB RAM, 12 cores
GitHub - torognes/vsearch: Versatile open-source tool for microbiome analysis

Reading file /tmp/qiime2/martindogniez/data/056f2a80-816e-4d12-a1e6-2ad92a9e69d3/data/dna-sequences.fasta 100%
6044907994 nt in 9446234 seqs, min 250, max 1600, avg 640
Sorting by length 100%
Dereplicating 100%
Sorting 100%
4032586 unique sequences, avg cluster 2.3, median 1, max 15293
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
Killed

My guess would be than I’m running into some memory issue, as the initial sequence file is really large (9.446.234 sequences), and I’m running the analysis on my own laptop for now (32BG of RAM). However, I’m planning to switch to the cluster of my university for the classification part, so I wanted to know if this would solve the problem, of if the issue lies somewhere else.

Thanks in advance for any help !

1 Like

Hi @Martin_D,

I could be wrong, but it appears that vseasrch clustering is working, but failing at the writing stage? :thinking:

Could it be a storage issue? Otherwise, I agree that it could be a memory issue with post processing, as it appears that your machine has ~16 GB not ~32 GB of RAM:

On another note, I'd likely set --p-mode to uniq, lca, or majority, instead of super. At least, you should be aware of how super works, as outlined here.

-Mike

1 Like

I agree with @SoilRotifer - if your step is failing after reading and sorting, it’s not your RAM.
But if it fails at a writing step, that could be a storage issue, but that’s easy to check: is your personal computer maxxed out on disk storage?
Two other things:

  1. Are you writing to an attached (external) drive, or something mounted internally?
  2. Can you try re-running it? If you do, do you get the same error?
2 Likes

Hi @devonorourke, hi @SoilRotifer,

Thanks so much for your quick suggestions, you're really on the ball! Indeed, after learning more about how super mode works, it wasn't the best tool for me here. I should definitely force myself to go deeper into understanding how all these tools work...

I tried lca and majority modes, with the same exact disappointing result as super mode; but fortunately the third try and uniq mode seemed to work properly, reducing my number of sequences from 9,446,234 to 4,105,931! Nevertheless, I'm very curious as to why the lca and majority modes also failed. If you have any other links to descriptions of the various methods available with qiime rescript dereplicate, I'd be happy to browse them!

@SoilRotifer Regarding the storage issue, I still have ~ 70GB free, which is not plenty but should be enough for these kind of files I think ? But for the RAM, you're pointing an interesting issue I never noticed; my installed RAM on the computer is 32GB, but it would seem that here only 16GB can be exploited ? It may have to do with the fact that I'm trying to work with a Linux Subsystem installed on my Windows machine, I should dig into that and the misconfigurations I might have done, but this is definitely outside of the scope of this forum !

@devonorourke For now I'm not using any external drive, and I avoided to link any of the files or directories involved in Linux works with my cloud storage.

Regardless, many thanks for your precious help !

1 Like

See this link. Specifically:

Since WSL2 is a virtual machine you'll need to assign it resources. Microsoft provides some sensible defaults. WSL2 is allowed access to all CPU cores and GPU cores if you have WSLg installed. Memory is limited to half of your system's memory.

I'd assume if you increase the available memory to WSL, these command options should work? :man_shrugging:

But yeah, it looks like storage may not be an issue... Assuming that WSL has access to all available HD storage, which I think it should.