phylogenetic analysis

Dear all,
Since there is not enough memory available on my system to perform phylogenetic analysis for all samples (n = 100) at once i used the following command lines for each sample separately

 qiime alignment mafft \
  --i-sequences rep-seqs.qza \
  --o-alignment aligned-rep-seqs.qza

#mask (or filter) the alignment to remove positions that are highly variable. These positions are generally considered to add noise to a resulting phylogenetic tree.
qiime alignment mask \
  --i-alignment aligned-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza

#create the tree using the Fasttree program
qiime phylogeny fasttree \
  --i-alignment masked-aligned-rep-seqs.qza \
  --o-tree unrooted-tree.qza

#root the tree using the longest root
qiime phylogeny midpoint-root \
  --i-tree unrooted-tree.qza \
  --o-rooted-tree rooted-tree.qza
i generated a rooted-tree.qza  file for each sample how can i merge those files ?

Thanks

Hi @M_F,
Unfortunately, you cannot merge the trees. So running these steps for each individual sample and then merging the trees is not an option. You could try iteratively adding new sequences to the alignment (see qiime alignment mafft-add) but many of these will be redundant between samples, so splitting your rep-seqs into chunks would be more effective than splitting by sample and making separate alignments.

Maybe filtering out low-abundance sequences would be a better option to reduce memory requirements? and/or perform taxonomy classification first and then filter out the unclassified sequences (which are often junk/non-target). Reducing the number of rare variants and junk (many of which are probably not biologically important/interesting, depending on your experimental question) will vastly reduce the number of sequences, reducing memory requirements for multiple sequence alignment.

Either that or maybe you could temporarily access a more powerful computer to perform the alignment? 8-16GB RAM (or less!) is usually enough for most amplicon sequence alignments.

Good luck!

1 Like

@M_F @Nicholas_Bokulich,
You might find some other tricks using MAFFT in my recent post on building a COI database. This is similar to the suggestion here using mafft-add, except the memory-saving portion requires a particular parameter, --keeplength, to be included when adding the additional alignment sequences to your initial reference alignment. It works as follows:

  1. If you have any information about these reference sequences (say taxonomy, or sequence length, or kmer composition), use that to your advantage when building a small “reference” alignment. That is, take a subset of your reference sequences not based on what sample they came from, but their evolutionary relationship to each other. You want something in the range of 100 - 1000 sequences here, with as few gaps as possible. I’m going to assume you are able to subset your sequences into a file called subset_seqs.fasta.

    To start, create an alignment of just that small subset of sequences:

mafft --auto --thread -1 subseqs_seqs.fasta > reference_MSA
  1. The other thing you need to do is then create a file of the remaining sequences that aren’t in that initial subset file. Let’s call those remaining_seqs.fasta. Once you’re created that reference alignment file (reference_MSA), you’ll then align these remaining sequences to the reference MSA. The key memory save kicks in with the --keep-length parameter:
export MAFFT_TMPDIR=$(pwd)   ## change this path as needed to a directory that accepts as much disk space as you need to complete the job
mafft --auto --addfull remaining_seqs.fasta --keeplength --thread -1 reference_MSA > complete_MSA

Hopefully this approach will save you some memory. I’m certainly not a mafft expert, but I’ve had great success in writing to the developer and getting very helpful feedback.

Good luck!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.