several issues with ghost-tree (running and building)

mica.tosi · May 14, 2021, 3:53pm

Hello!

My colleagues and I are trying to use ghost-tree to get some phylogenetic information from fungal ITS data, but we came across several setbacks that we can’t solve . Since we have questions about a few different aspects and the other posts were quite inactive, we decided to build a new post.

We have inquiries regarding three topics (sorry! ): one related to using pre-built ghost trees and two related to building our own tree, which is probably the best option, to be able to use up-to-date databases.

Using pre-built trees, we encountered an issue when running q2-diversity-core, as happened to other users (e.g., Ghost tree filtering? - #26 by Jennifer_Fouquier, Error when running Ghost tree): OTUs were not fully represented by the tree, even though the clustering was carried out with the exact same database. After re-reading posts and the ghost-tree paper, we realized we needed to filter the OTU table to keep only those IDs present in the tree. We did so, and it worked. Yet, we still have some questions/concerns about this table filtering step. Firstly, why is this still necessary if we clustered with the same database used for the ghost tree? Also, will it be necessary even if we build our own tree? And finally, considering it was discarding at least 25% of our OTUs (even higher with 90% or 100% ghost trees), are we not losing a significant amount of information when doing this? Maybe there’s a theoretical aspect we’re not quite grasping here, and you could help us ease our concerns.

Building our own trees, we came across two issues:

We were able to build our tree using q2-ghost-tree scaffold-hybrid-tree-foundation-alignment but, once again, we have the IDs-not-matching problem in the diversity core. We suspect it is being caused by an issue with the underscores being replaced by spaces (Ghost tree filtering? - #26 by Jennifer_Fouquier), which would be solved adding single quotes in the IDs, but we’re not sure how to do this. Firstly: Should the single quotes be added to the ghost tree node IDs or to one or more of the files used in q2-ghost-tree scaffold-hybrid-tree-foundation-alignment? And, if they should be added to the ghost tree, how would we do that?
We were not able to run q2-ghost-tree scaffold-hybrid-tree-foundation-tree because it wouldn’t take the --i-foundation-taxonomy file we were using. We tried the SilvaTaxonomy file we imported according to the tutorial (Q2-ghost-tree Plugin: Community Tutorial for Creating Hybrid-Gene Phylogenetic Trees):

qiime tools import
--input-path tax_slv_ssu_132.txt
--type SilvaTaxonomy
--output-path tax_slv_ssu_132.qza
--input-format SilvaTaxonomyFormat

Yet, it didn’t work because it needed to be a FeatureData[Taxonomy] file, which means this artifact is only used to run q2- ghost-tree extract-fungi. The problem is that the example to load the taxonomy foundation is done with an example file called “minitaxonomy_foundation.txt”, so we’re not sure which Silva file we need for this step. Which file would this be? And also, is there any difference between obtaining the ghost-tree from q2-ghost-tree scaffold-hybrid-tree-foundation-alignment or from q2-ghost-tree scaffold-hybrid-tree-foundation-tree?

Wow, that was a lot of questions , but I think that’s all. We’re sorry for sending such a long post but we’re hoping you can help us understand, at least little by little, what’s going on with these issues and how we can solve them.

Thanks in advance!

mica

Jennifer_Fouquier · May 14, 2021, 4:42pm

Hi there! Thanks for your questions and the details. Sorry you are having trouble. I am actively working on very different projects right now so I will have to get to this early-mid next week but I will give it extra attention. If anything changes, please respond or update this post. Thank you!

mica.tosi · May 14, 2021, 7:45pm

Thanks for the quick response, Jennifer! I'll make sure to update the post if anything changes but I think we've run out of ideas

Jennifer_Fouquier · June 1, 2021, 10:20pm

Hi Mica,

I sent you a repaired file that contains single quotes and the Python script I used. Thanks for your patience!

Regarding the need to filter the OTU table. One issue with creating a ghost tree is that it uses taxonomy to join the two databases (because there's not a good link between them, the only way so far is to use as much taxonomic information to get to a decent phylogeny). And many of the taxonomic names are "unclassified" so then where do they get placed? They try to get grouped with other identified organisms (Phoma, Candida, etc) but if they don't have a spot then they get discarded. So a lot of those unclassified ones get discarded from the tree...meaning that you can have an OTU that will be in your table but it won't have any match in your tree. So there is a decent amount of data lost. So you should ask yourself are you interested in understanding the differences in your samples using "some" genetic information and information about relatedness? If so, then ghost-tree is all we have for some marker genes and I'd say 75% of the data gives you a lot of information. If you're fine with just count based data then you could also try Bray Curtis or Binary Jaccard but of course there's no genetic information used in those metrics. You could just repeat the analysis using different metrics and know that what you're discussing is different things regarding beta-diversity. What are the differences between my communities if I use genetic information? What is the difference if I don't use genetic information? Let me know if this helped. I know it's a bit strange to see the data loss. If you group your OTUs at 85 before making a ghost-tree that could help more because the unclassified OTUs will get grouped a bit better.

Let me know if you're still stuck on some of the other qs and I can try to help now that I've revisited this a bit.

mica.tosi · June 2, 2021, 2:44pm

Thanks a lot, Jennifer. For the help and the answers.
I'll give it a try today and I'll keep in mind this compromise you are mentioning here. I think depending on the study it might still very useful even though part of the data has to be discarded.
Stay safe!