Suggestions to speed up phylogenic tree construction?

aabrams · March 29, 2024, 1:44pm

Hello,

I have a data set of 110 samples of paired end 16s amplicon reads sequenced with MiSeq. I am doing this locally on my laptop so divided these into sub groups and did the processing, alignment, merging (I had a QC report of the complete dataset to make trimming decisions). That worked well and I merged the resulting feature tables. I am trying to do phylogenic tree construction using the qiime phylogeny align-to-tree-mafft-fasttree pipeline. I know tree construction can be time consuming, especially since I am doing it on a laptop with only 8 CPUs. But it has been running for almost 3 days. I am not sure if it is actually working or just stuck at this point. I am looking for alternatives if I am unable to run it locally.

There were 77,126 unique features according to the taxonomy table. I forgot to generate the .qzv file for the seq-reps.qza file prior to starting the tree construction so I do not have additional information on that file.

Looking through the forum and qiime2 tutorials, I came across the information on Galaxy and thought that might be a good solution. I attempted to use phylogeny align-to-tree-mafft-fasttree in the Qiime2 tool suite. My seq-rep.qza file uploaded successfully and it started running but after a few minutes I received the following warning:

Unexpected error loading arguments in
q2galaxy: /mnt/efs/fs1/cancer-
usegalaxy-shared/database/datasets/00
0/122/dataset_122480.dat was created
by 'QIIME 2024.2.0'. The currently
installed framework cannot interpret
archive version '6'.

I have put in a help request with Galaxy so I will see what I hear back from them (unless anyone on here knowns what the issue is!). I have seen different options for tree construction that speed up the process. I am hesitant to use these since they all seem to have limitations or cautions about the quality of the resulting tree. I feel I am still too new to this to be a good judge of where I could potentially reduce the amount of input without compromising the quality of the output. I would rather it take a little longer but have more confidence in my output. Any suggestions on this situation? Is Galaxy an option I should keep trying or is that likely a dead-end I am headed down? Am I incorrect in thinking I could run this size data on my laptop, even if some of the steps take a long time?

I appreciate information or suggestions!

SoilRotifer · March 29, 2024, 2:21pm

HI @aabrams

Can you provide more details on how you generated your sequence features? I ask becuase 77,126 features sounds like an awful lot, even for 110 samples. I assume many are very low count or are singletons?

Also, what region of the 16S rRNA gene are you using?

Can you provide examples / links of those options here? We can let you know if those are anything to be concerned about.

This is typical when trying to generate phylogenies for very large data sets. Back in the old days, it'd take weeks to generate a tree based on a dataset 1/20th the size!

When performing de novo phylogenies using, FastTree, RAxML, IQ-TREE, etc... you are often limited in the number of CPUs actually used. The parallelization tends to benefit longer sequence data, not these short length amplicons. That is, these phylogenetic inference tools, break up long alignments into manageable chunks and send those chunks off to different CPUs.

With short amplicon reads, these tools will often only use 2 - 4 CPUs, regardless of how many CPUs you tell it to use. That is, using more CPUs is of no benefit, and can actually slow things down, as the CPUs spend more time talking with each other rather than doing the computation that they should.

One approach you can try is the fragment-insertion approach, this is already part of QIIME 2. This is more parallelizable than the de novo approaches. I'd suggest running on a machine with more RAM though.

timanix · March 29, 2024, 4:08pm

Hello! Hope it OK that I will join.

Did you use the same parameters between sub groups for all steps? Like primers removal, Dada2 settings? Different parameters will produce different ASVs even on the same sequences.

I totally agree with @SoilRotifer that 77K unique sequences is very high number.
It is good idea to filter out sequences based on total count (for example, filter out sequences that were counted less than 10 times) and / or prevalence (sequences that were found in less than, for example, 3 samples). This two step may significantly speed up tree construction since ASVs that, for example, were found only 1-3 times in 1-2 samples most likely do not carry any meaningful information for analyses but slow down alignments. Also, it is possible to assign taxonomy first and remove sequences from organelles before tree construction.

Best,

aabrams · March 29, 2024, 5:23pm

Thank you for your response and this information!

We amplified the V1-V3 region (27F and 519R), 2x300 pair end reads on illumina MiSeq. The reads were demultiplexed, using DADA2 I trimmed 20bp to remove the primers and selected truncation parameters based on quality information (that was obtains from the run as a whole) and by trying a few variations to see what happened with various truncation parameters. Since my laptop could not handle that many files at once, I divided the files into groups of 30 samples (antidotally I had heard that ~40 samples of paired end reads was about the max most laptops can handle).

qiime dada2 denoise-paired --i-demultiplexed-seqs 1INDV_BRDC_paired-end-demux.qza --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 290 --p-trunc-len-r 270 --o-table 1INDV_table.qza --o-representative-sequences 1INDV_rep-seqs.qza --o-denoising-stats 1INDV_denoising-stats.qza

I then merged all of the rep-seq.qza files useing qiime feature-table merge and qiime feature-table merge-seqs.

I was following the “Atacama soil microbiome” and "Moving Pictures' tutorials. Most Qiime2 tutorials use the de novo tree construction with FastTree and I am still new enough that the variations to some of these methods are a little intimidating due to my lack of knowledge. but I actually just watched the QIIME2 Phylogenic Reconstruction video you posted on youtube and that has really helped my understanding of de novo vs fragment insertion. Do you think it would be a better approach to use fragment insertion, as described here: https://library.qiime2.org/plugins/q2-fragment-insertion/16/ , or keep with FastTree since I can't change my RAM at this moment? As a side note because I am trying to understand all of this better, from your video and some of the QIIME2 resources I read, it sounds like fragment insertion is a more accurate tree construction method, if i have that correct, why is it not recommended over FastTree methods most used in the tutorials? Is it due to a need for greater RAM over FastTree?

I need to go back and find the specific post/tutorials citing the drawbacks to various methods....I have done a lot of googling and not saved them all.

I have also realized that I made a large error in not filtering prior to tree construction. Previously when I was working with 16s data, I was provided a biome file and tree file and worked with the data from that stage on. One of my first steps was to filter out low reads. So I mistakenly thought I should wait to filter out low reads until after this step.

It hurts a little to abort the current job since it has been running for 3 days but it sounds like I need to look at/process my data better first since something may be a little fishy with 77,126 features for this sized data set.

aabrams · March 29, 2024, 5:32pm

Thank you for your input as well! Yes, I am realizing I made a big error in thinking I should filter after tree construction and not before. I appreciate you providing specific filter values as well. I previously used 5 times or less working with OTUs a little while back. But I was going to use 10 on this data set since I had been seeing that used more commonly lately as a cut off value.

I did use the same parameters on all the subsets of data to avoid introducing variation from that. But I made sure to make truncation choices based on a QC report for the entire data set. I am going to filter the data and try again!

timanix · March 29, 2024, 5:45pm

Glad that the comment was useful to you. Please note that values I provided are just for example. I use them for most datasets but sometimes adjust based on the data.
Another thing that I would like to bring up is that you used Dada2 alone without cutadapt. I always prefer to remove primers with cutadapt before Dada2 rather than trimming primers by Dada2.
There is a couple of reasons for that:

Cutadapt is more flexible and have a lot of settings to play with.
It have very cool for me functionality to discard sequences that don't contain primers. I guess this step may reduce amount of unique sequences by discarding sequences with no primers in them.

In my expirience, removing primers with cutadapt before Dada2 results in lower number of unique ASVs without significant reduction in overall counts. Reducing number of unique ASVs will also speed up tree construction.

SoilRotifer · March 29, 2024, 5:52pm

To echo @timanix's comments about using cutadapt.

I'd avoid simply using the DADA2 trim options for removing primers. The reason for this is that, there can sequencing indels (insertion / deletion) within the first 20 or so bases of the read. When you trim based on a fixed length you might arbitrarily keep or loose a base or two... this has the effect of drastically and erroneously inflating the amount of unique ASVs.

This might be the reason why you have so many ASVs. I'd recommend running cutadapt, using the --discard-untrimmed and --p-match-adapter-wildcards options. This will produce a much cleaner output with far less ASVs.

aabrams · March 29, 2024, 5:58pm

Thank you for the advice. I did look at using Cutadapt prior to Dada2, but most tutorials/forums were saying that it was unnecessary if you were using Dada2 since it performs this function and you only needed to use Cutadapt if barcodes were present....but maybe I should rethink this also and do more research as well.

SoilRotifer · March 29, 2024, 6:04pm

This is not entirely true, for the reasons I explained. Also, cutadapt is often used to simply remove the primer sequences from the reads. That is, enter in the PCR primer sequence in 5' - 3' direction for the forward and reverse primers. There are many examples in the forum on this specific use case.

See the following:

timanix · March 29, 2024, 6:07pm

Maybe somewhere in the ideal world...

After this comment from @SoilRotifer I would say that cutadapt before Dada2 is not only desirable but "must have" step to avoid artificial biases in alpha/beta diversity metrics.

aabrams · March 29, 2024, 6:26pm

Thank you @SoilRotifer and @timanix, I will use Cutadapt first from here on out. This is useful, @SoilRotifer I had previously read some of those forum post, but also noted that current tutorials were using Dada2 for primer trimming. Between that and a conversation with a few people currently using Dada2, I got the impression that overtime Dada2 function had improved and using Cutadapt first was no longer necessary or was just preference based. Helpful to know though that that I misunderstood and Cutadapt is still the best practice.

SoilRotifer · March 29, 2024, 6:30pm

No worries @aabrams. Actually, I'd highly suggest you try both approaches and compare the outputs for yourself. If you do compare them please let us know how similar or different they end up being. Every data set is different.