I’m working with about 110,000 representative 16S sequences (~436 bp average length) in QIIME 2 (2025.7). I’m planning to build a phylogenetic tree but am unsure whether to use the traditional de novo approach (MAFFT + FastTree) or to do fragment insertion using SEPP.
I tried running the MAFFT de novo pipeline on a node with 32 GB RAM, but it failed due to insufficient memory.
Could anyone advise on:
Which method generally works better or is more accurate for large datasets like this?
Which one tends to be more memory efficient?
Whether upgrading to 64 GB or more would be enough to run the de novo approach?
That is an awful lot of sequences! Generally, a de novo approach will use lots of memory for both alignment and searching tree-space. But you should be okay with 64 GB RAM.
Some questions:
Are these ASVs? OTUs?
Have the primer sequences been removed from the reads?
This can affect phylogenetic reconstruction / topology.
When running the `align-to-tree-mafft-fasttree` pipeline did you set `--p-parttree` for alignment? This will help reduce memory usage and run time. See here.
The fragment insertion approach, for this many sequences, can also take a while. But can be more easily parallelized, than the de novo approach. That is, parallelization of de novo tree construction typically only gains benefits from longer sequences, and may not scale well beyond ~4 processors when dealing with ~ 436 bp.
As for which is “better”, I’d recommend reading the fragment insertion paper. Differences between the two approaches can vary by data set. For some data, I’ve noticed no differences, others some differences.