@M_F @Nicholas_Bokulich,
You might find some other tricks using MAFFT in my recent post on building a COI database. This is similar to the suggestion here using mafft-add
, except the memory-saving portion requires a particular parameter, --keeplength
, to be included when adding the additional alignment sequences to your initial reference alignment. It works as follows:
-
If you have any information about these reference sequences (say taxonomy, or sequence length, or kmer composition), use that to your advantage when building a small “reference” alignment. That is, take a subset of your reference sequences not based on what sample they came from, but their evolutionary relationship to each other. You want something in the range of 100 - 1000 sequences here, with as few gaps as possible. I’m going to assume you are able to subset your sequences into a file called
subset_seqs.fasta
.To start, create an alignment of just that small subset of sequences:
mafft --auto --thread -1 subseqs_seqs.fasta > reference_MSA
- The other thing you need to do is then create a file of the remaining sequences that aren’t in that initial subset file. Let’s call those
remaining_seqs.fasta
. Once you’re created that reference alignment file (reference_MSA), you’ll then align these remaining sequences to the reference MSA. The key memory save kicks in with the--keep-length
parameter:
export MAFFT_TMPDIR=$(pwd) ## change this path as needed to a directory that accepts as much disk space as you need to complete the job
mafft --auto --addfull remaining_seqs.fasta --keeplength --thread -1 reference_MSA > complete_MSA
Hopefully this approach will save you some memory. I’m certainly not a mafft expert, but I’ve had great success in writing to the developer and getting very helpful feedback.
Good luck!