phylogenetic analysis

devonorourke · August 25, 2020, 1:24pm

@M_F @Nicholas_Bokulich,
You might find some other tricks using MAFFT in my recent post on building a COI database. This is similar to the suggestion here using mafft-add, except the memory-saving portion requires a particular parameter, --keeplength, to be included when adding the additional alignment sequences to your initial reference alignment. It works as follows:

If you have any information about these reference sequences (say taxonomy, or sequence length, or kmer composition), use that to your advantage when building a small "reference" alignment. That is, take a subset of your reference sequences not based on what sample they came from, but their evolutionary relationship to each other. You want something in the range of 100 - 1000 sequences here, with as few gaps as possible. I'm going to assume you are able to subset your sequences into a file called subset_seqs.fasta.

To start, create an alignment of just that small subset of sequences:

mafft --auto --thread -1 subseqs_seqs.fasta > reference_MSA

The other thing you need to do is then create a file of the remaining sequences that aren't in that initial subset file. Let's call those remaining_seqs.fasta. Once you're created that reference alignment file (reference_MSA), you'll then align these remaining sequences to the reference MSA. The key memory save kicks in with the --keep-length parameter:

export MAFFT_TMPDIR=$(pwd)   ## change this path as needed to a directory that accepts as much disk space as you need to complete the job
mafft --auto --addfull remaining_seqs.fasta --keeplength --thread -1 reference_MSA > complete_MSA

Hopefully this approach will save you some memory. I'm certainly not a mafft expert, but I've had great success in writing to the developer and getting very helpful feedback.

Good luck!