Estimating time and memory for masking large dataset

I want to mask my very large (>1.5 million sequences) aligned_rep_seqsDNA.qza data. I performed the alignment with SINA 1.7.2 using SILVA as a reference database, and imported the FASTA output into QIIME2. The file is valid according to

qiime tools validate aligned_rep_seqsDNA.qza

I'm using QIIME2-2021.2 via my institution's HPC with miniconda3/4.8.5 and using a SLURM scheduler. I have been attempting to mask the data so that I can go on to make a tree. Using the following:

qiime alignment mask
--i-alignment aligned_rep_seqsDNA.qza
--o-masked-alignment masked-aligned-rep-seqs.qza

has required a huge amount of computing power. I get out of memory errors when I’ve allocated less than 180GB of RAM. My most recent attempt was killed as it timed out after a week.

At this point I am unsure if:
• there’s an infinite loop somewhere in what I’m trying to do?
• I need to bite the bullet and assign even more computing time? The SLURM scheduler automatically times out after 48 hours and requires that an amount be assigned for any process to run longer.
• there’s another approach I should be considering?

Run with 180GB of RAM allocated for 2 days
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 31
CPU Utilized: 1-23:57:19
CPU Efficiency: 3.22% of 62-00:09:49 core-walltime
Job Wall-clock time: 2-00:00:19
Memory Utilized: 160.72 GB
Memory Efficiency: 89.29% of 180.00 GB

Run with 184GB of RAM allocated for 7 days
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 6-23:59:10
CPU Efficiency: 3.12% of 224-00:11:12 core-walltime
Job Wall-clock time: 7-00:00:21
Memory Utilized: 160.72 GB
Memory Efficiency: 87.35% of 184.00 GB

I’ve successfully completed the masking on a subset of 50 lines from the file, which took 17 seconds and basically 0 memory. I’m currently running it on 200,000 lines to make sure that can be accomplished. (I forgot to increase the memory, but already had it go for 12 hours before it exceeded the 5GB I assigned.)

I’ve taken a look at
@devonorourke ‘s response phylogenetic analysis - #3 by devonorourke with O’Rourke’s recommendation to take a look at the Building a COI database from BOLD references Building a COI database from BOLD references for MAFFT tricks. I realize I’ve already bypassed MAFFT for alignment purposes, but since masking is usually packaged with it, I figured I’d check. The database building tutorial is sufficiently over my head to determine if this is even relevant for what I’m trying to do.
• The documentation for mask. I do not understand implications of the max-gap-frequency and min-conservation with respect to the norms of analyses enough to want to play with those without context/guidance.
• How long should I expect my QIIME2 jobs to run for? How long should I expect my QIIME2 jobs to run for? The poster there had ~6x the number of sequences I have and got feedback that a few hours was likely all that they needed. From that I guess I need to look at a different parameter to determine how hefty my data are?

Note: these data had all been previously analyzed in QIIME1 by someone else, so I do have some other information already (e.g. number of OTUs / sample). Is there anything I can use to get a better estimate of how much resource to throw at this thing?

Hello @BeckWehrle,

There's a bunch of things involved with optimizing for HPC, and I'm not an expert. However, I noticed a few clues that might be helpful:

:female_detective: :man_technologist:

Run with 180GB of RAM allocated for 2 days
Cores per node: 31
CPU Efficiency: 3.22% of 62-00:09:49 core-walltime
Memory Utilized: 160.72 GB
Memory Efficiency: 89.29% of 180.00 GB

You are not running out of memory, with 20 GB left. You are using 1 of the 31 cores assigned (1/31 == 3.225%).

Run with 184GB of RAM allocated for 7 days
Cores per node: 32
CPU Efficiency: 3.12% of 224-00:11:12 core-walltime
Memory Utilized: 160.72 GB
Memory Efficiency: 87.35% of 184.00 GB

You are not running out of memory, with 23 GB left. You are using 1 of the 32 cores assigned (1/32 == 3.12%).

Looks like you have enough memory, but are only using one core!

Check on the SLURM settings to make sure the single Qiime process can use all 32 threads (and it's not expecting 32 copies of Qiime to use 1 thread each!)
(This is a real setting. I've never wanted it, but I've got it on occasion :roll_eyes: )

Thank you for your detailed search of the forum before asking this question! :smile_cat:

I'm happy to help with your other questions, but let's get these threads sorted first!

2 Likes

This is absolutely excellent feedback (and perfect timing) since I am meeting with our system admin tomorrow! I had no idea that's what CPU efficiency meant and I now have a better sense of what I need to talk to them about. I'll post an update once I learn more.

1 Like

Perfect! They will have much better advice on how to use their system!

EDIT: Thanks for sharing what you learn to help future users! :qiime2:

Late to the party here, and bringing forth questions, not answers. Sorry!

Question: is it possible to parallelize qiime alignment mask? Just thinking that if it was the case that the mask operation examines each column separately, you might be able to speed up the run time by running the script for N chunks of the input --i-alignment artifact.

Say, if each seq in your original alignment file was 1000 characters long, you could split the file into 10 segments, such that each sub-alignment file was only 100 characters long:

  • file1=cols1-100 (of original alignment),
  • file2=cols101-200, ...
  • file10=cols901-1000.

Then you would just run the qiime alignment mask on each of these sub-alignment files:

qiime alignment mask --i-alignment subalignment1.qza --o-masked-alignment subalignment_maskedout1.qza
qiime alignment mask --i-alignment subalignment2.qza --o-masked-alignment subalignment_maskedout2.qza
...
qiime alignment mask --i-alignment subalignment10.qza --o-masked-alignment subalignment_maskedout10.qza

:warning: Admittedly, I'm ignoring the steps of (1) how you would split the alignment up by column in the first place, and (2), how you would join the columns thereafter! Nevertheless, I don't think these operations are intractable, but perhaps are tough when saved as a .qza object. In other words, you might need to export the file initially before splitting it up, but perhaps others here have their own recommendations. As usual, this forum is filled with text reformatting magicians :woman_mage:

If it's the case that the alignment script reads the file into memory in it's entirety, splitting up the original file into chunks should benefit both on the runtime side and the memory side (as you're calculating and reading in 10% of the original file, respectively).

I doubt you'll need any big resources from your HPC: if you observed ~160 GB of RAM needs for the whole file, then hopefully as you split jobs into smaller chunks, these resources needs will scale down accordingly. My guess is that the number of necessary CPUs and RAM is actually not the issue - it's the runtime, and as @colinbrislawn observed, you threw 32 cores at the job, and only 1 was being used.

Of course, this recommendation/question is all useless if you need to run qiime alignment mask on the full file :man_shrugging: !?

Good luck!

2 Likes

I took a look at the code for qiime alignment mask, available here:

Based on my quick read of this, the filter is per-column in the alignment, so yeah, that should totally work!

The implementation is mostly based on skbio.TabularMSA, which includes some other functions you might find helpful. For example, you could use msa.loc[..., 0:100] to get the first 100 rows of your MSA object, then run the masking function on each batch of rows.

I'm not sure how best to scale this, but I wanted to point you to some of the code to get you started.
:man_technologist: :man_technologist:

Colin

Here's where I've gotten so far using the first 5 sequences from my larger file. My aligned sequences are 50,001 characters long. I'm starting with small_aligned_seqsDNA.fasta and have broken it into 10 files of 5000 characters (#10 is 5001) via:

cut -c 1-5000 small_aligned_seqsDNA.fasta > small_aligned_1.fasta
cut -c 5001-10000 small_aligned_seqsDNA.fasta > small_aligned_2.fasta
.
.
cut -c 45001-50001 small_aligned_seqsDNA.fasta > small_aligned_10.fasta

However, this eats off the metadata/names, leaving an empty row before each sequence. QIIME import didn't like that so I copied every odd row from the first file to get the metadata in order using code that my CS-literate partner found online:
awk 'NR%2==0{print $0}' < small_aligned_1.fasta > names.fasta

then copied the even rows to get the sequences (in this example from small_aligned_2.fasta with characters 5001-10000), again with code he found for me online:
awk 'NR%2==1{print $0}' < small_aligned_2.fasta > small_aligned_2seqs.fasta

and put them back together, alternating the names then the sequences using this online found code:
awk '{print; getline < "small_aligned_2seqs.fasta"; print}' names.fasta >> small_aligned_2.named.fasta

(Annoyingly, this step would put all the metadata at the top, then the sequences the first time I ran it for a file. Upon deletion of the small_aligned_2.named.fasta file, I would run it again and get the alternating names/ sequences/ names/ sequences I was trying for. This is likely a personal problem.)

I've imported the 10% width file into QIIME2:

    qiime tools import \
    --input-path small_aligned_2.named.fasta \
    --output-path small_aligned_2.qza \
    --input-format AlignedDNAFASTAFormat \
    --type FeatureData[AlignedSequence]

for each of the 10 files.

I ran:

qiime alignment mask \
--i-alignment small_aligned_2.qza \
--o-masked-alignment small_masked_2.qza

for all 10 files and was successful for the middle sequences. Files 1-2 and 7-10 generated:
Plugin error from alignment:

No alignment positions remain after filtering. The filter thresholds will need to be relaxed. 100.00% of positions were retained by the gap filter, and 0.00% of positions were retained by the conservation filter.
Which seems very reasonable to me when only using five sequences. In trying to put this small subset back together, I'm only going to try to use files 3-6 that had data to mask. I assume that this will not arise when I'm using the full ~1.5 million reads.

Concurrently, I've been running qiime alignment mask on the first 10% of the full file. I got out of memory errors at 10GB memory and below, so I allocated it 50GB and it's been going for the past 8 hours.

qiime alignment mask \
--verbose \
--i-alignment aligned_1.qza \
--o-masked-alignment masked_1.qza

I'm going to see how this subset of the full file goes.

1 Like

@colinbrislawn I took a look at the skbio.TabularMSA link and scikit-bio and that is significantly over my head :grimacing: Is this a tool I definitely want to know right now, or okay to bookmark for later when I'm understanding things better?

Hi Everyone,

I ran into a similar problem when I was building a tree for the SILVA reference database. That is, qiime alignment mask was too slow to be useful for the roughly 430k sequences in the SILVA db.

My solution was to roll my own mask method. I made it available in this gist.

I am not in a position to lift that gist up to production-level code right now, but if someone else is, please do. I might get time to look at it a bit later in the year, otherwise.

Sorry @BeckWehrle if you're under time pressure and my response is frustrating.

Ben

5 Likes

Well summarized. I was worried that the core challenge of this problem will only appear at scale, and it looks like that's the case.

Heck, I'm not even sure if alignment is an appropriate method for your data. How do you interpret an alignment that is 90% gaps? Is your MSA program designed or parameterized to align such wildly divergent sequences? We are talkin' :apple: to :orangutan: here...

Yikes!

Check out Ben's method, linked above, but also consider your underlying biological question.

What are you trying to solve with this massive, gappy, MSA? :thinking:
Maybe there's a better solution! :nerd_face:

1 Like

TL;DR: I'm heading back to square one and asking my collaborators for the FASTQ files and any other associated information instead of the .fna file they gave me.

also consider your underlying biological question.

I've been quiet as I have been thinking this part out. You're absolutely right, 50k characters for this alignment doesn't make any sense. These are 16s sequences and I'd totally been ignoring that the whole point of amplicon sequencing in this context is that this area should be conserved.

I'd barreled through without thinking about my MSA as I'd been Having trouble with alignment: mafft, SEPP, & SINA and so took getting a valid file as success. After a couple other troubleshooting bouts, including subsampling the rep-seqs and taking that subset through to a rooted phylogenetic tree (which was not happy to play with filtered-tables from the full rep-seqs while I was trying to do alpha rarefaction).

After taking a look at my files and code with a microbiome person on my campus, she noticed that my rep-seqs file is really not particularly de-replicated, despite my code looking as expected. However, we couldn't go back much farther because the data I was given access to was already demultiplexed, filtered of chimeras, and the paired-end reads were merged. I don't know what other processes happened before I got the file. I have some data sleuthing to do.

Thank you for all of your help! I've certainly gotten a lot out of understanding what I'm doing from this exercise!

1 Like

Hi @BeckWehrle ,
I don't have a solution to offer, but maybe a clue for you to chase down :mag:

by any chance have you checked whether the reads are all in the same orientation? Some library prep methods — and particularly older data — result in reads that are in mixed orientations (i.e., the Illumina forward and reverse adapters are ligated to random ends, resulting in reads that are in both directions).

Attempting a MSA on mixed-orientation reads could result in very poor alignment if the reads are not re-oriented. It looks like mafft does not adjust the read direction by default (there is an optional flag but I do not think that this is used by q2-alignment).

I am guessing that this is maybe what you mean by this:

i.e., you have replicate sequences because the same sequence appears in each orientation?

The plugin RESCRIPt has a method for re-orienting sequences against a database (e.g., SILVA). Or you could try aligning with mafft outside of QIIME 2, or maybe SEPP can handle mixed orientation reads? In any case, I recommend inspecting for this before trying one of these solutions, as it could just lead to more time/complication if this is not the cause.

I am not really sure of this diagnosis because I would think that surely someone else would have reported such an issue in the past, since mixed orientation reads are not all that uncommon. It might just be the size of your job together with mixed orientations leading to resource constraints.

2 Likes