Ghost tree filtering?

thermokarst · April 17, 2019, 11:33pm

Oh, I left one thing out of that — even with the underscore issue aside, the table’s features are created from OTU clustering using sh_refs_qiime_ver7_dynamic_s_01.12.2017.fasta, which uses a slightly different ID scheme than the sh_refs_qiime_ver7_dynamic_01.12.2017.fasta file in the other variant. So even if the underscores were quoted, the IDs just aren’t quite the same.

import skbio

sh_refs_qiime_ver7_dynamic = skbio.io.registry.read('sh_refs_qiime_ver7_dynamic_01.12.2017.fasta', format='fasta')
sh_refs_qiime_ver7_dynamic_s = skbio.io.registry.read('sh_refs_qiime_ver7_dynamic_s_01.12.2017.fasta', format='fasta')

sh_refs_qiime_ver7_dynamic_ids = {s.metadata['id'] for s in sh_refs_qiime_ver7_dynamic}
sh_refs_qiime_ver7_dynamic_s_ids = {s.metadata['id'] for s in sh_refs_qiime_ver7_dynamic_s}

print(len(sh_refs_qiime_ver7_dynamic_ids))
print(len(sh_refs_qiime_ver7_dynamic_s_ids))
print(len(sh_refs_qiime_ver7_dynamic_ids.intersection(sh_refs_qiime_ver7_dynamic_s_ids)))

The last three lines return:

30696
58049
29818

So there is some overlap between the two variants of that DB, but not full overlap. This will cause problems in a variety of places if you mix and match between the two.

Jennifer_Fouquier · April 18, 2019, 2:20pm

Thanks for checking the provenance @thermokarst. Even if she used two different DBs (which I do not recommend), there was overlap with the IDs in her table and the tree, so when I escaped the underscores in the tree, qiime diversity should have worked for her files after filtering the table to contain only overlapping IDs (this is mentioned in the tutorial as well).

Not sure if you saw when I tagged you on this comment here, but a few weeks ago I tried to escape them: Ghost tree filtering? - #8 by Jennifer_Fouquier

If using a single quote before each underscore is the best way, I will give it a try again. Thanks!

thermokarst · April 18, 2019, 2:28pm

Totally, but, not there is not enough overlap. For the phylogenetic diversity methods in q2-diversity to work (or really any tool/method that I know of), the tree has to have a tip for every feature in the table. That is not the case here.

Yep! That should do it!

I can't stress this enough though, I don't thank that the underscore/quoting situation will fully address the problem, because not all of the feature IDs in the table are present in the phylogenetic tree (even when quoted correctly). Either the table needs to be filtered to drop the features that are missing in the phylogeny (probably not a good move), or, the tree and the table need to be constructed using the same reference database. Does that make sense? It is fine if the tree has more features than the table, so long as it is a superset of the table (every table ID is represented in the tree).

Yep, sorry, it was not clear to me that you were specifically looking for help with the underscore quoting, I was under the impression you were just generally looking for help on the topic. Sorry for the misunderstanding!

thermokarst · April 18, 2019, 2:55pm

I added underscores to the IDs in the ghost tree to demonstrate that not all the table’s feature IDs are present in the phylogeny:

import qiime2
import skbio
import biom

table_artifact = qiime2.Artifact.load('table-cr-973.qza')
tree_artifact = qiime2.Artifact.load('ghost-tree-midpoint-root2.qza')

table = table_artifact.view(biom.Table)
tree = tree_artifact.view(skbio.TreeNode)

# in place ID update
for n in tree.tips():
    n.name = n.name.replace(' ', '_')

table_ids = set(table.ids(axis='observation'))
tip_ids = {tip.name for tip in tree.tips()}

print(len(table_ids))
print(len(tip_ids))
print(len(table_ids.intersection(tip_ids)))

The last three print statements return

3157
23574
1810

So, of the 3157 features in the table, only 1810 are present in the phylogeny - after making sure the underscores are in the IDs. Put another way, the phylogeny is missing 1347 features that are found in the table (but not the tree). I hope that helps!

Jennifer_Fouquier · April 23, 2019, 6:43pm

Hi @ihoxie, so I had an epiphany recently when @thermokarst mentioned that the Newick format by design converts underscores to spaces if the ID is not placed into single quotes. This is mentioned inconsistently in Newick format documentation and when ghost-tree was developed I was unaware of this. The original UNITE IDs I was working with did not have underscores, so it just never came up. So your issues were 100% not your fault. Thanks to you and Matthew for working with me on this! I know it was time consuming.

You can find newly built and correctly formatted trees for the s_02.02.2019 and 02.02.2019 UNITE versions here.

I like the 80% clustered ones so that you don't discard a lot of unclassified organisms, but that's up to you depending on your project and desired accuracy vs lost data.

I wanted to make sure I could get your data through to Emperor so I checked it here using the original files you sent me and the feature table was filtered to remove IDs that are not found in the tree you gave me. Please note though, that as @thermokarst mentioned, you were using inconsistent databases accidentally, so I would make sure to use the same UNITE database as well as the corresponding ghost tree.

I

Let me know if you have any questions!

ihoxie · May 1, 2019, 5:25am

Thank you so much for all your help and for updating everything!
Sorry I must not have caught the “s” in the different files

*Also as a side note, in case anyone else does this, for a while I kept trying “feature-table filter-samples” instead of the “filter-feature” against the accessions file so obviously it was just filtering out every sample…
Once I tried the new 2019 versions and replaced the accessions file spaces with underscores, it all worked!

Rupesh_Sinha · November 12, 2019, 12:37pm

@Jennifer_Fouquier Hi, I am new to qiime2 ITS analysis and was trying to use Fungal ITS analysis tutorial for analysing the fungal ITS using the UNITE database unite-ver7-99-seqs-01.12.2017. I could reach till taxonomic classification and would like to further analyse beta diversity using q2-ghost-tree. I have re-rooted the pre-built ghost-tree ghost_tree_80_qiime_ver7_99_01.12.2017/ghost_tree.nwk and got ghost-tree-midpoint-root.qza. Next, I have to filter my feature-table using
ghost_tree_80_qiime_ver7_99_01.12.2017/ghost_tree_extension_accession_ids.txt (If I understand the protocol correct). I looked for the
‘qiime feature-table filter-features’ arguments in https://docs.qiime2.org/2019.1/tutorials/filtering
but not able to understand what should be input parameters as we can only give one input for e.g our feature-table but how it will take reference from ghost_tree_extension_accession_ids.txt? I am bit confused at this point. Please guide me from ghost-tree filtering to beta diversity in qiime2.