Phylogenetic tree: Reconstruct vs. Prune - When to?

Hi,

This is a relatively simple question. When does one choose to reconstruct a tree vs. prune a phylogenetic tree? Are constructed trees comparable?

  1. Say I perform a meta-analysis, and I have a large collection of studies. I now want to split up the studies into subgroups based on a factor (say, country of study), and analyse each one independently. Do I have to reconstruct the tree?
  2. Say I pool of multiple unrelated projects in my sequencing run. Can I analyse them as one batch till I obtain a phyloseq object, and then filter samples & prune taxa (& tree) into their individual phyloseq objects? Or should I split them up right after demultiplexing?

We use align-to-tree-mafft-fasttree it if it matters. How does it compare to raxml and iqtree?

Thanks,
Arval

Hi @AviTil,

I want to preface this by saying I am not a phylogenetic tree expert.

In QIIME 2, your phylogenetic tree can be a superset of your data, but it can not be a subset. For example, if I had a meta-analysis where I built a large tree and filtered different studies, I could reuse my large tree for each study. However, I couldn't use that large tree on a completely unrelated study because the large tree is unlikely to have all the ASVs from the unrelated study.

I am not quite sure what you mean by this. Can you clarify? The diversity metrics resulting from the phylogenetic tree/trees will be most comparable with one large tree. Distances on a tree in QIIME 2 are typically relative to the study, so the distances on the tree will be more consistent if you are using the same superset tree for each study in your meta-analysis.

I think I answered this above, but you could create a superset tree ( a tree that has all the samples from your meta-analysis) and use that for each of your subgroups.

I need a little more information to advise you here. Are these the same Variable regions? Does it make sense to apply the same quality control and truncation to all your sequencing? Are they coming off the same sequencing run( DADA2 does sequence run-based corrections).

Hope this helps!

3 Likes

Hi @cherman2,
Thanks for your reply. Let me clarify the question -

Question 1: Say, I constructed a phylogenetic tree with sequencing data from a single study (Tree A), and I also constructed a phylogenetic tree by pruning the super-tree from my meta-analysis just to include this study alone (Tree B). Would Tree A and B be comparable? Would the branch lengths and nodes be comparable? Note in this case, the two studies are from similar environments.

Question 2: If I had a sequencing run that contained samples from two non-related projects, very completely different, say gut microbiome, and soil. And they sequenced the same V region, were multiplexed, and pooled to be sequenced on the same lane & flowcell of the sequencing instrument. Can I analyse the data from the entire sequencing run as a single batch till I create a super-phyloseq object and then split it into project-specific phyloseq objects later? Does the presence of ASVs unreleated to the environment, influence the tree construction? Will the final pruned phyloseq be same as a tree obtained if I had split the data upstream and constructed individual trees for the 2 projects?

What are good data practices for these scenarios?

Thanks for the clarification,
Arval

Hi @AviTil

I do not believe that Tree A and Tree B would be comparable. As I said above, the distances on a tree in QIIME 2 are typically relative to the ASV that you built the tree with, so Tree A and Tree B are most likely not comparable. Also, their resulting phylogenetic diversity metrics are likely not comparable. This is why I was pushing towards a superset tree (made with all studies) that you never prune. This is really the only way to guarantee that the phylogenetic distances on the tree are comparable across groups.

Great Question! The tree would almost certainly be affected by the 2 extremely different environments which you are building the tree with.

My next question here is, if these are two completely different and unrelated environments, what's the goal of comparing the studies' trees?

2 Likes

As for context regarding question 2. I am the lab technician in my group tasked with creating amplicom libraires, sequence them and getting data back in a phyloseq object to the project students. So I was wondering at what point should I subset the data to their respective projects. Again to clarify, I'm not comparing between studies. I am comparing subset the samples right after DADA2, and then proceed with tree and classification independently for each project OR if I can process the data en mass for the full sequencing run, generate tree, classify, import to phyloseq and then subset samples and prune taxa. I hope that clears my use case.

1 Like

Hi @AviTil,
If I were doing this, and I didn't care about relating the samples from the different studies, I would split based on study just after running DADA2, and then run the downstream steps on a per study basis. Either way should work, but I think generating the tree with fewer sequences is likely to give better trees (and definitely quicker).

All of that said, these de novo trees are known to only provide rough representations of the evolutionary relationships between the ASVs - none of them are great from the perspective of generating a reliable phylogenetic representation of the sequences being analyzed.

An alternative that would get you a better tree (but is specific to 16S) would be to use q2-fragment-insertion, as illustrated here. Alternatively, you could use a kmer-based approach, as illustrated here - this achieves very similar diversity calculation results without actually building a phylogenetic tree. In both of these cases, I would still probably split my studies just after DADA2 and run downstream steps on a per-study basis, but I don't think that decision is likely to have a big impact on the final results.

Hope this helps!

1 Like

Hi @gregcaporaso

Thanks for the reply! This helps!