Phylogenetic tree with noisy reads

Hi,
I am trying to build a phylogenetic tree using Nanopore reads. Currently I am using qiime phylogeny align-to-tree-mafft-fasttree command; however, since I did not perform any clustering, I am aligning all sequences to each other, and this probably causes my job to run out of memory.
In fact, when using only 1000 reads, I am able to obtain the desired phylogenetic tree.
In particular, this is the command I am using:

qiime phylogeny align-to-tree-mafft-fasttree
–i-sequences rep-seqs.qza
–o-alignment aligned-rep-seqs.qza
–o-masked-alignment masked-aligned-rep-seqs.qza
–o-tree unrooted-tree.qza
–o-rooted-tree rooted-tree.qza
–p-n-threads $threads

and this is the error I get when using the whole dataset:

/home/simone/miniconda3/envs/MetONTIIME_env/bin/mafft: line 2440: 21008 Killed “$prefix/disttbfast” -q $npickup -E $cycledisttbfast -V “-”$gopdist -s $unalignlevel $legacygapopt $mergearg -W $tuplesize $termgapopt $outnum $addarg $add2ndhalfarg -C $numthreadstb $memopt $weightopt $treeinopt $treeoutopt $distoutopt $seqtype $model -f “-”$gop -Q $spfactor -h $aof $param_fft $algopt $treealg $scoreoutarg < infile > pre 2>> “$progressfile”
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 24 /tmp/qiime2-archive-akns4pmq/9f1e0e1e-83a5-4908-8020-fe8f8ee29c69/data/dna-sequences.fasta

Traceback (most recent call last):
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2cli/commands.py”, line 327, in call
results = action(**arguments)
File “</home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/decorator.py:decorator-gen-226>”, line 2, in align_to_tree_mafft_fasttree
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 240, in bound_callable
output_types, provenance)
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 477, in callable_executor
outputs = self._callable(scope.ctx, **view_args)
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2_phylogeny/_align_to_tree_mafft_fasttree.py”, line 18, in align_to_tree_mafft_fasttree
aligned_seq, = mafft(sequences=sequences, n_threads=n_threads)
File “</home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/decorator.py:decorator-gen-481>”, line 2, in mafft
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 240, in bound_callable
output_types, provenance)
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 383, in callable_executor
output_views = self._callable(**view_args)
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2_alignment/_mafft.py”, line 85, in mafft
run_command(cmd, aligned_fp)
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2_alignment/_mafft.py”, line 27, in run_command
subprocess.run(cmd, stdout=output_f, check=True)
File “/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/subprocess.py”, line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command ‘[‘mafft’, ‘–preservecase’, ‘–inputorder’, ‘–thread’, ‘24’, ‘/tmp/qiime2-archive-akns4pmq/9f1e0e1e-83a5-4908-8020-fe8f8ee29c69/data/dna-sequences.fasta’]’ returned non-zero exit status 1.

What is the approach you would suggest? I was thinking about these possibilities:

  • Subsample some reads from rep-seqs.qza and build the tree aligning them
  • Subsample a lower number of reads from each sample (same number for each sample), reimport them and use them for building the tree
  • Use an alternative method (if any exist) that might reduce the memory usage.
  • Retrieve from the database the reference sequences to whom at least x reads have been assigned, and build the tree using those ‘error-free’ representative sequences

Thanks in advance,
Simone

Hey there @MaestSi! How many threads are you specifying in this script? You can trade in some speed for more RAM.

I would suggest looking into this, there are a few tree-building options in q2-phylogeny:

https://docs.qiime2.org/2019.7/plugins/available/phylogeny/

There is a good community tutorial, here:

Hi @thermokarst

I have been trying with 24 out of 32 but it failed after a few days; now I am trying with 1 single thread, let’s see how it goes…

I already had a look at that; the problem is that, as far as I understand, there seem to be many options for building a tree, but I get stuck at the first step of the qiime phylogeny align-to-tree-mafft-fasttree pipeline, which is the sequence alignment. I think this is something quite different to what is usually done with Illumina after OTU/ASV picking, since in this case the number of sequences to be aligned corresponds to the full dataset, so the tutorial might not fit perfectly this specific case.

That isn’t quite correct, this pipeline does alignment, masking, and tree building (that is why it is a Pipeline, instead of 4 separate Methods). Running one of the other workflows I linked to above is exclusive to running this Pipeline.

Yeah, this is going to be the problem here, with more threads comes the need for more memory, I bet stepping that down a bit (say, 8 or less?) will strike a nice balance between speed and memory.

1 Like

I think the step you mention are not run in parallel, but in a consequential way. And since the error I get is:

/home/simone/miniconda3/envs/MetONTIIME_env/bin/mafft: line 2440: 21008 Killed

my guess is that it was the alignment that was failing. However, doing a subsampling of the reads before running the pipeline made it feasible to complete the job. Since in this case the representative sequences are all the sequences, I guess the number of representative sequences doesn’t have per se a biological meaning, and so it makes perfect sense to me so subsample the same number of representative sequences from each sample and to align them. Thank you for your advice about reducing the number of threads too!