Phylogenetic tree with noisy reads

MaestSi · September 9, 2019, 4:01pm

Hi,
I am trying to build a phylogenetic tree using Nanopore reads. Currently I am using qiime phylogeny align-to-tree-mafft-fasttree command; however, since I did not perform any clustering, I am aligning all sequences to each other, and this probably causes my job to run out of memory.
In fact, when using only 1000 reads, I am able to obtain the desired phylogenetic tree.
In particular, this is the command I am using:

qiime phylogeny align-to-tree-mafft-fasttree
--i-sequences rep-seqs.qza
--o-alignment aligned-rep-seqs.qza
--o-masked-alignment masked-aligned-rep-seqs.qza
--o-tree unrooted-tree.qza
--o-rooted-tree rooted-tree.qza
--p-n-threads $threads

and this is the error I get when using the whole dataset:

/home/simone/miniconda3/envs/MetONTIIME_env/bin/mafft: line 2440: 21008 Killed "$prefix/disttbfast" -q $npickup -E $cycledisttbfast -V "-"$gopdist -s $unalignlevel $legacygapopt $mergearg -W $tuplesize $termgapopt $outnum $addarg $add2ndhalfarg -C $numthreadstb $memopt $weightopt $treeinopt $treeoutopt $distoutopt $seqtype $model -f "-"$gop -Q $spfactor -h $aof $param_fft $algopt $treealg $scoreoutarg < infile > pre 2>> "$progressfile"
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: mafft --preservecase --inputorder --thread 24 /tmp/qiime2-archive-akns4pmq/9f1e0e1e-83a5-4908-8020-fe8f8ee29c69/data/dna-sequences.fasta

Traceback (most recent call last):
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2cli/commands.py", line 327, in call
results = action(**arguments)
File "</home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/decorator.py:decorator-gen-226>", line 2, in align_to_tree_mafft_fasttree
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
output_types, provenance)
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py", line 477, in callable_executor
outputs = self._callable(scope.ctx, **view_args)
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2_phylogeny/_align_to_tree_mafft_fasttree.py", line 18, in align_to_tree_mafft_fasttree
aligned_seq, = mafft(sequences=sequences, n_threads=n_threads)
File "</home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/decorator.py:decorator-gen-481>", line 2, in mafft
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
output_types, provenance)
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in callable_executor
output_views = self._callable(**view_args)
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 85, in mafft
run_command(cmd, aligned_fp)
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 27, in run_command
subprocess.run(cmd, stdout=output_f, check=True)
File "/home/simone/miniconda3/envs/MetONTIIME_env/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['mafft', '--preservecase', '--inputorder', '--thread', '24', '/tmp/qiime2-archive-akns4pmq/9f1e0e1e-83a5-4908-8020-fe8f8ee29c69/data/dna-sequences.fasta']' returned non-zero exit status 1.

What is the approach you would suggest? I was thinking about these possibilities:

Subsample some reads from rep-seqs.qza and build the tree aligning them
Subsample a lower number of reads from each sample (same number for each sample), reimport them and use them for building the tree
Use an alternative method (if any exist) that might reduce the memory usage.
Retrieve from the database the reference sequences to whom at least x reads have been assigned, and build the tree using those 'error-free' representative sequences

Thanks in advance,
Simone

thermokarst · September 10, 2019, 9:42pm

Hey there @MaestSi! How many threads are you specifying in this script? You can trade in some speed for more RAM.

I would suggest looking into this, there are a few tree-building options in q2-phylogeny:

There is a good community tutorial, here:

MaestSi · September 11, 2019, 8:07am

Hi @thermokarst

I have been trying with 24 out of 32 but it failed after a few days; now I am trying with 1 single thread, let's see how it goes...

I already had a look at that; the problem is that, as far as I understand, there seem to be many options for building a tree, but I get stuck at the first step of the qiime phylogeny align-to-tree-mafft-fasttree pipeline, which is the sequence alignment. I think this is something quite different to what is usually done with Illumina after OTU/ASV picking, since in this case the number of sequences to be aligned corresponds to the full dataset, so the tutorial might not fit perfectly this specific case.

thermokarst · September 14, 2019, 3:32pm

That isn't quite correct, this pipeline does alignment, masking, and tree building (that is why it is a Pipeline, instead of 4 separate Methods). Running one of the other workflows I linked to above is exclusive to running this Pipeline.

Yeah, this is going to be the problem here, with more threads comes the need for more memory, I bet stepping that down a bit (say, 8 or less?) will strike a nice balance between speed and memory.

MaestSi · September 15, 2019, 8:12am

I think the step you mention are not run in parallel, but in a consequential way. And since the error I get is:

/home/simone/miniconda3/envs/MetONTIIME_env/bin/mafft: line 2440: 21008 Killed

my guess is that it was the alignment that was failing. However, doing a subsampling of the reads before running the pipeline made it feasible to complete the job. Since in this case the representative sequences are all the sequences, I guess the number of representative sequences doesn't have per se a biological meaning, and so it makes perfect sense to me so subsample the same number of representative sequences from each sample and to align them. Thank you for your advice about reducing the number of threads too!

Alxdu · March 5, 2020, 8:59pm

@MaestSi Did this method ever worked for you. I would like to hear your opinion on using core phylogenetic pipelines in qiime with nanopore long-reads.
I have a dataset that I've imported at the OTU feature/taxonomy tables for non-phylogenetic methods, but I wonder if I should work in qiime with sequence data from the ground up.

MaestSi · March 6, 2020, 7:09am

Hi @Alxdu,

while for doing taxonomy assignment I usually align each read one by one to a database, I think that for doing some phylogenetic diversity analyses it might be better to perform some clustering (e.g. at 80% threshold) in order to pick representative sequences. I implemented this in the Evaluate_diversity.sh script, but I have not tested it thoroughly. I just verified that introducing the clustering to pick representative sequences drastically reduces running time, and that the results looked like somehow meaningful at least when comparing very different samples. Let me know if you check it out with your samples!
Simone

Alxdu · March 6, 2020, 2:29pm

Thanks for the feedback. Clustering could work decently with increasing read length. I will have to give it a try.
For non-phylogenetic methods, I was perfectly happy to do taxonomic classifications in centrifuge and then import results the OTU tables into qiime. But then I'd be missing out on unifrac, which would be nice to have.

MaestSi · March 6, 2020, 4:30pm

I never tried it, but it seems interesting too! Let me know if you check out the clustering approach!
Simone