Weird taxonomic names

Hi All,

Not sure if this belongs here or general discussion, but, Ive got a problem squaring taxonomy and phylogeny. I say with with the caveat that Id expect some polyphyletic clades... except mine appears at the phylum level.

I have a 2 x 300 V3-V4 where I merged ends in vsearch, loaded them into QIIME, and then ran deblur. I built my tree using SEPP fragment insertion, and did taxonomic classification with a classifier I trained on my model. For taxonomic analysis, I filtered down to a set of ~300 ASVs that I considered "high abundance/high prevalence"

I wanted to make a heatmap where I by the phylogenetic tree and hierarchical clustering of my distance matrix along the two axes. So, I imported my QIIME artefacts into Python, turned them into a pandas dataframe, and two scikit-bio linkage matrices. As a sanity check/reference point, I used row labels colored by phylum and class.

All the clustering looked good and logical to me... except for the Tenericutes (orange) sitting in the middle of my Firmicutes (green). I think Tenericutes used to be considered part of Firmicutes, and I'd definately expect to see some polyphyletic clades among Firmictutes at lower levels, but I was suprised to see it see it here.

So, I guess after a long explanation, Im concerned there's something wrong in my pipeline. I think the point of failure could either be (a) classification, (b) tree building, or (c) the linkage matrix. I was really careful when building the linkage matrix to order the tree with my dataframe. Did I just do something really stupid and the distances are large, but the frame ended up funky?

Im re-assured by the fact that the rest of the phyla and clades cluster pretty well. But, advice or suggestions for debugging would be welcome!

Thanks,
Justine

Hi @jwdebelius,
Very interesting! It is reassuring to see that the classification and phylogeny correspond in all other ways. Let’s assume it’s not the linkage matrix, since everything else looks good and since you have good control over that step.

I recommend starting with classification. It looks like you have two features classified as tenericutes? I suggest grabbing those sequences and using NCBI BLAST to confirm phylum and class-level affiliations.

You should also grab some of the Firmicutes that are part of the same branch… it is a bit difficult to discern but it looks like Firmicutes are actually split into two separate branches, with the branch containing the putative tenericutes (whatever the light green class is) clustering with the purple phylum.

Next you could manually inspect the tree — maybe look at the reference tree to see where tenericutes sit relative to other Firmicutes, and whatever that light green class is.

1 Like

@Nicholas_Bokulich,

Thanks so much for your suggestions. The first Tenercutes (the light orange) mapped to a full sequence in NCBI. The second came up with a mixed identity, the majority being Bacilli (the light green). (The in-between is confirmed as a genus Peptococcus). And, the confidence on all three assignments is more than 0.9999, which is higher than like 50% of my assignments. I tried using the SEPP classifier (I know its experimental, but I figured it might be easier to figure out based on placement), and I get the same Tenericutes classification for that sequence.

So, I guess going forward, does it make sense to just find a different way to display my data so the linkage doesn't look quite so weird (there is something about the way clustermap is rendering the linkage from the distance matrix vs the way it gets rendered if I pass the distance matrix alone, see below), but it doesnt explain the discrepancy between NCBI and the greengenes.

So, yeah, I would love additional suggestions.

Best,
Justine

Hi Justine,

Nice picture. I’m interested in how you made this heatmap. This was made only using Qiime2?

Thanks,
FS

Hi @fstudart,

This is a heatmap of my phylogenetic tree, but similar principals apply. To generate this from a filtered tree and table qza, I did the following

  1. Import into Python via the python API and view as an skbio.Treenode
  2. Load the taxonomy via the same api and view as a pd.Series, followed by a clean up step I tend to use that manages inherientance for missing strings, and seperates them into individual columns
  3. I think the rows are the linkage matrix I built from the tree object, so that got converted to a distance matrix (scipy.cluster.hiearchy.linkage(tree.tip_to_tip_distances().condensed_form())
  4. Converted taxonomic names into colors
  5. Passed to seaborn clustermap.

Best,
Justine

2 Likes

Hi Justine,

Thanks very much for the detailed information.

FS

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.