I want to mask my very large (>1.5 million sequences) aligned_rep_seqsDNA.qza data. I performed the alignment with SINA 1.7.2 using SILVA as a reference database, and imported the FASTA output into QIIME2. The file is valid according to
qiime tools validate aligned_rep_seqsDNA.qza
I'm using QIIME2-2021.2 via my institution's HPC with miniconda3/4.8.5 and using a SLURM scheduler. I have been attempting to mask the data so that I can go on to make a tree. Using the following:
qiime alignment mask
--i-alignment aligned_rep_seqsDNA.qza
--o-masked-alignment masked-aligned-rep-seqs.qza
has required a huge amount of computing power. I get out of memory errors when I’ve allocated less than 180GB of RAM. My most recent attempt was killed as it timed out after a week.
At this point I am unsure if:
• there’s an infinite loop somewhere in what I’m trying to do?
• I need to bite the bullet and assign even more computing time? The SLURM scheduler automatically times out after 48 hours and requires that an amount be assigned for any process to run longer.
• there’s another approach I should be considering?
Run with 180GB of RAM allocated for 2 days
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 31
CPU Utilized: 1-23:57:19
CPU Efficiency: 3.22% of 62-00:09:49 core-walltime
Job Wall-clock time: 2-00:00:19
Memory Utilized: 160.72 GB
Memory Efficiency: 89.29% of 180.00 GB
Run with 184GB of RAM allocated for 7 days
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 6-23:59:10
CPU Efficiency: 3.12% of 224-00:11:12 core-walltime
Job Wall-clock time: 7-00:00:21
Memory Utilized: 160.72 GB
Memory Efficiency: 87.35% of 184.00 GB
I’ve successfully completed the masking on a subset of 50 lines from the file, which took 17 seconds and basically 0 memory. I’m currently running it on 200,000 lines to make sure that can be accomplished. (I forgot to increase the memory, but already had it go for 12 hours before it exceeded the 5GB I assigned.)
I’ve taken a look at
• @devonorourke ‘s response phylogenetic analysis - #3 by devonorourke with O’Rourke’s recommendation to take a look at the Building a COI database from BOLD references Building a COI database from BOLD references for MAFFT tricks. I realize I’ve already bypassed MAFFT for alignment purposes, but since masking is usually packaged with it, I figured I’d check. The database building tutorial is sufficiently over my head to determine if this is even relevant for what I’m trying to do.
• The documentation for mask. I do not understand implications of the max-gap-frequency and min-conservation with respect to the norms of analyses enough to want to play with those without context/guidance.
• How long should I expect my QIIME2 jobs to run for? How long should I expect my QIIME2 jobs to run for? The poster there had ~6x the number of sequences I have and got feedback that a few hours was likely all that they needed. From that I guess I need to look at a different parameter to determine how hefty my data are?
Note: these data had all been previously analyzed in QIIME1 by someone else, so I do have some other information already (e.g. number of OTUs / sample). Is there anything I can use to get a better estimate of how much resource to throw at this thing?