Creating a phylogenetic tree


I have a couple of question regarding creating phylogenetic tree. I have samples from different projects in one (16s rRNA) MiSeq run and I get the feature table with DADA2. Since I plan to run phylogeny based diversity metrics I will need a tree.

  • Why even create a new tree? Can we directly use the gg_13.8_99% phylogenetic tree or would it only make sense if I had used a closed reference OTU picking instead of DADA2?

  • Assuming I have to create a new tree, do I need to create a phylogenetic tree per project or can I create one tree for the MiSeq run and keep using that for diversity analysis per project?

  • As mentioned here “… the q2-fragment-insertion plugin , which is currently available as a Community Plugin… At some point we’ll likely transition to using this approach for phylogenetic reconstruction in cases where there is a reference tree available (as is the case for 16S).” So would you recommend using the gg reference tree as the starting point when creating a new tree for our own data or would you suggest de novo like shown in the Moving pictures tutorial.



1 Like

Hi Rich,

@why: Unifrac or Faith PD identify the correct tips in the phylogenetic tree by the feature names, which are nucleotide sequences in case of DADA2, but OTU numbers in the gg_13.8_99% tree. The “mapping” (and a bit more) between those two trees can be realised by using the q2-fragment-insertion plugin.

@one tree per project: DADA2 sequences are independently placed into the very same reference phylogeny. Thus, you would get the same results either way. I suggest building one tree for your run, i.e. feature table. You can separate projects afterwards. This will save you work and compute time.

@denovo: NEVER use a de novo tree build from short sequences! Amplicon sequences don’t harbour enough phylogenetic signal to reconstruct a proper phylogenetic tree. Use q2-fragment-insertion instead! See Figure 2 in for detailed explanations.



I was able to download the q2-fragment-insertion via conda, but wondering if you have install via docker in the works?


Hi @Stefan and @Richard_Rodrigues1, I figured I’d QIIME-in on this.

I agree with Stefan’s very good and succinct responses. I would only like to emphasize the following:

If you have a reference tree (and the associated sequences) for your marker gene, then you are likely better off using the q2-fragment-insertion. Especially, if your reference phylogeny (and associated representative sequences) encompass neighboring relatives of which your sequences can be reliably inserted. Obviously, this is not always possible and constructing a de novo tree may be your only option and can be sufficient. Though in many cases, making use of a curated reference set in which to perform fragment insertion may be ideal.

Many factors can affect the reliability of a given phylogeny. But I’d like to highlight a few here:

  1. The alignment method (e.g. de novo or reference based) and alignment algorithm / parameters used.
  2. The length of the alignment and whether or not an alignment is masked. As Stefan pointed out, there is often less phylogenetic information from which to build a robust phylogeny from short sequences.Though this depends on the amount of informative sites and observed alignment patterns.
  3. The phylogenetic model under-which the phylogeny was constructed. Depending on the marker gene (region) used, this can also have large impacts on the overall resulting phylogeny. Being able to use or even test for appropriate substitution models may be limited by the tools used (e.g. some offer only one or a few models). In the worst case, even using an appropriate model may not even help as there is not enough sequence information to begin with!

For points 2 & 3, IQ-TREE has some recommendations for constructing “more reliable” phylogenies when confronted with building a de novo tree from very short reads. I’ve tried to outline some of that here. That being said, I generally agree with Stefan’s assessment. That is, using q2-fragment-insertion. But it does not hurt to also compare with the de novo approach. I hope I’ve helped to add a little insight.

-Best wishes on your :evergreen_tree: building! :slight_smile:


@docker: I hope to find some time soon to migrate the q2-fragment-insertion plugin into the qiime2 core distribution. Once that is accomplished, it should be available by default, i.e. also in the qiime2 docker container.


Couple more related questions:

  • Is Greengenes 13_8 at 99% the defaults to the “–i-reference-alignment” and “–i-reference-phylogeny”? In that case we do not need to provide the paths to those artifacts?

  • Is the “classify-otus-experimental” still in experimental phase?

  • How is “classify-otus-experimental” related (or different compared) to the “feature-classifier”? Is the later preferred to assign taxonomy to the feature table?

  • Do we perform “fragment-insertion filter-features” on the pre or post normalized feature tables?

  • I understand the rationale of “rejecting insertion of fragments that are too remotely related to everything in the reference alignment/phylogeny.” However, ignoring these “unknown or dark matter” microbes may affect our interpretation of diversity and even our ability to investigate if they are involved in a particular disease. That said, I was wondering if we should use both (filtered + unifrac) and (unfiltered + non phylogeny) based methods to get better idea of diversity. Alternatively, would it be useful to NOT discard these remotely related fragments when making the insertion tree?



Hi Rich,

I’ll let Stefan address your first set of questions. But, I can speak to the last point. Given your interest in potentially “unknown or dark matter” microbes, it might be worth it to simply compare the fragment insertion output to that of the non-phylogenetic approaches and (if your curious) the de novo alignment and tree building approach.

This is generally a good idea as it helps with sanity-checking your data and analysis steps (e.g. how sensitive is your data to these different approaches?). In some cases both the insertion and de novo approach can give similar results, other times they can differ. If anything, comparing these outputs will help you get a handle on which features may be driving the patterns in your data. Hopefully, these different outputs help you decide which approach(es) you can use to best interpret your data.

Some Suggestions:
You can take the features that are not inserted into the tree and classify, BLAST, etc., these sequences to see if they are “real”. If so you can follow up with the non-phylogenetic or de novo phylogenetic approaches. Who knows, if these sequences can be convincingly shown to be “real”, but are not represented in the reference tree you are using, then you potentially have a strong lead for investigating these features further.

Secondly, q2-fragment-insertion allows you to use your own or other curated reference data sets. I think you should be able to download a reference tree and sequences from SILVA. There may be extra work involved to format and import these into QIIME.

Finally, and generally speaking, by forcing the insertion of poorly matched features into a tree, you will end up introducing the very (similar?) phylogenetic artifacts that you are trying to avoid with the de novo approach. Which is why there is a threshold for not inserting features that do not meet the given criteria. In this case, I’d recommend comparing with the non-phylogenetic approaches. It is quite common to iterate through a set of approaches, in fact, I typically make an effort to analyze my data using both phylogenetic (i.e. de novo and fragment insertion) along side the non-phylogenetic approaches. There is, often, no single-best way to explore a given data set. Thanks for your insightful questions! :slight_smile:

I am sure Stefan will have more to add.



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.