Ghost tree filtering?

Sorry to hear that @Jennifer_Fouquier, I hope all is well :heart_decoration:


@ihoxie -- I think you can start moving forward based on my recommendations in the meantime.

No rush @Jennifer_Fouquier, hope everything goes okay!
@thermokarst Thanks for checking it out!
The feature ID issue was from trying to match the pre-made trees to my data as opposed to using the plug in, but I guess there’s mismatch in both.
When I’d tried using the sh_refs_qiime_ver7_dynamic_01.12.2017.qza for the extensions_cluster step and then running the ghost-tree scaffold-hybrid-tree-foundation-alignment,
I got an error:
Traceback (most recent call last):
File “/Users/ihoxie/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2cli/commands.py”, line 274, in call
results = action(**arguments)
File “”, line 2, in scaffold_hybrid_tree_foundation_alignment
File “/Users/ihoxie/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 231, in bound_callable
output_types, provenance)
File “/Users/ihoxie/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 362, in callable_executor
output_views = self._callable(**view_args)
File “/Users/ihoxie/q2-ghost-tree/q2_ghost_tree/_scaffold_hybrid_tree_foundation_alignment.py”, line 44, in scaffold_hybrid_tree_foundation_alignment
gt_path, graft_level, None)[0]
File “/Users/ihoxie/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/ghosttree/scaffold/hybridtree.py”, line 104, in extensions_onto_foundation
graft_level)
File “/Users/ihoxie/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/ghosttree/scaffold/hybridtree.py”, line 205, in _extension_genus_accession_dict
taxonomy = accession_taxonomy_dic[i]
KeyError: ‘SH124384.07FU_AY997045_refs_singleton’

Plugin error from ghost-tree:

‘SH124384.07FU_AY997045_refs_singleton’

See above for debug info.

I looked through the files and can’t find that ID in all the files, but the format appears identical, so not sure how you tell if the files match. Sorry for the confusion!

That error is because of this:

This is an issue with the source data — the IDs should be the same between the taxonomy and the sequences, otherwise how can you figure out which one belongs to which? Maybe you should double check that you used the right input files (for example, didn’t mix-and-match database versions).

1 Like

@thermokarst thanks for looking into this for me. I agree with your comment about the feature IDs not matching, but that was later when she was trying to build a custom tree after the pre-built tree use failed for her.

Earlier in the thread (on her first file upload, not the second one) there is still an issue that she identified that is possibly caused by underscores in the IDs. I’m having the same issue she is where I see the IDs matching between her feature table and the pre-built ghost tree .nwk but it’s still failing when we run qiime diversity. I tried escaping the underscores but it still failed. This part should be super simple for users because they’re just using the qiime diversity plugin with a pre-built ghost tree .nwk file, not q2-ghost-tree. Any ghost tree .nwk is just a phylogenetic tree that they can use in qiime diversity analysis. Do you or any of the team have any insight about this underscore issue? Thank you!

1 Like

Hi @Jennifer_Fouquier! I don't think there is an underscore related issue --- I dove down into the provenance of @ihoxie's input files and it looks like two different versions of the database were accidentally used (taxonomy from one db variant, seqs from another).

Screengrab from the UNITE DB:

QZA UUID Source File DB variant
m2taxonomy.qza 5fc4a97c-d600-420f-8e5d-c921c055747b sh_taxonomy_qiime_ver7_dynamic_s_01.12.2017.txt "Includes singletons set as RefS (in dynamic files)."
sh_refs_qiime_ver7_dynamic_01.12.2017.qza 37fdc128-7b5d-4ab0-b49c-ee30021f02e8 sh_refs_qiime_ver7_dynamic_01.12.2017.fasta "Includes global and 97% singletons."

So, the taxonomy was imported from the first row above ('includes singletons'), while the seqs were imported from the second row ('include global & 97% singletons'). These two different versions of the database have two different ID schemes that don't overlap. @ihoxie, my suspicion is that this was done on accident. If that is the case, go ahead and choose two files from the same database, then try again.

Thanks!

@thermokarst, we're looking at different files. :slight_smile: I'm not talking about when she tried to build her own ghost tree. She should be able to use the pre-built trees and it still isn't wasn't working for her. That's why she later tried to build a custom ghost tree (which yes those IDs are mismatching). These are the files reposted from her earlier post on this thread. I think if you read this and then re-read my previous comment it should make sense. She should just be able to use these files and run qiime diversity. Let me know if I need to clarify more.

Thanks!

table-cr-973.qza (263.0 KB)
ghost-tree-midpoint-root2.qza (493.8 KB)
Metadatajustsamples.tsv (2.7 KB)

1 Like

Yep, all makes sense. I think that is actually the same problem, a case of mismatched IDs. Check out this demonstration:

import qiime2
import skbio
import biom

table_artifact = qiime2.Artifact.load('table-cr-973.qza')
tree_artifact = qiime2.Artifact.load('ghost-tree-midpoint-root2.qza')

table = table_artifact.view(biom.Table)
tree = tree_artifact.view(skbio.TreeNode)

table_ids = set(table.ids(axis='observation'))
tip_ids = {tip.name for tip in tree.tips()}

print(len(table_ids))
print(len(tip_ids))
print(len(table_ids.intersection(tip_ids)))

Those last three print:

3157
23574
0

As to why, I think your underscore comment leads us to the answer here:

So if ghost-tree is producing trees with underscore in the IDs, those IDs will need to be escaped with a single quote, otherwise they will turn into spaces!

Hope that helps! :qiime2: :t_rex:

Oh, I left one thing out of that — even with the underscore issue aside, the table’s features are created from OTU clustering using sh_refs_qiime_ver7_dynamic_s_01.12.2017.fasta, which uses a slightly different ID scheme than the sh_refs_qiime_ver7_dynamic_01.12.2017.fasta file in the other variant. So even if the underscores were quoted, the IDs just aren’t quite the same.

import skbio

sh_refs_qiime_ver7_dynamic = skbio.io.registry.read('sh_refs_qiime_ver7_dynamic_01.12.2017.fasta', format='fasta')
sh_refs_qiime_ver7_dynamic_s = skbio.io.registry.read('sh_refs_qiime_ver7_dynamic_s_01.12.2017.fasta', format='fasta')

sh_refs_qiime_ver7_dynamic_ids = {s.metadata['id'] for s in sh_refs_qiime_ver7_dynamic}
sh_refs_qiime_ver7_dynamic_s_ids = {s.metadata['id'] for s in sh_refs_qiime_ver7_dynamic_s}

print(len(sh_refs_qiime_ver7_dynamic_ids))
print(len(sh_refs_qiime_ver7_dynamic_s_ids))
print(len(sh_refs_qiime_ver7_dynamic_ids.intersection(sh_refs_qiime_ver7_dynamic_s_ids)))

The last three lines return:

30696
58049
29818

So there is some overlap between the two variants of that DB, but not full overlap. This will cause problems in a variety of places if you mix and match between the two.

Thanks for checking the provenance @thermokarst. Even if she used two different DBs (which I do not recommend), there was overlap with the IDs in her table and the tree, so when I escaped the underscores in the tree, qiime diversity should have worked for her files after filtering the table to contain only overlapping IDs (this is mentioned in the tutorial as well).

Not sure if you saw when I tagged you on this comment here, but a few weeks ago I tried to escape them: Ghost tree filtering? - #8 by Jennifer_Fouquier

If using a single quote before each underscore is the best way, I will give it a try again. :laughing: Thanks!

1 Like

Totally, but, not there is not enough overlap. For the phylogenetic diversity methods in q2-diversity to work (or really any tool/method that I know of), the tree has to have a tip for every feature in the table. That is not the case here.

Yep! That should do it!

I can't stress this enough though, I don't thank that the underscore/quoting situation will fully address the problem, because not all of the feature IDs in the table are present in the phylogenetic tree (even when quoted correctly). Either the table needs to be filtered to drop the features that are missing in the phylogeny (probably not a good move), or, the tree and the table need to be constructed using the same reference database. Does that make sense? It is fine if the tree has more features than the table, so long as it is a superset of the table (every table ID is represented in the tree). :t_rex:

Yep, sorry, it was not clear to me that you were specifically looking for help with the underscore quoting, I was under the impression you were just generally looking for help on the topic. Sorry for the misunderstanding!

I added underscores to the IDs in the ghost tree to demonstrate that not all the table’s feature IDs are present in the phylogeny:

import qiime2
import skbio
import biom

table_artifact = qiime2.Artifact.load('table-cr-973.qza')
tree_artifact = qiime2.Artifact.load('ghost-tree-midpoint-root2.qza')

table = table_artifact.view(biom.Table)
tree = tree_artifact.view(skbio.TreeNode)

# in place ID update
for n in tree.tips():
    n.name = n.name.replace(' ', '_')

table_ids = set(table.ids(axis='observation'))
tip_ids = {tip.name for tip in tree.tips()}

print(len(table_ids))
print(len(tip_ids))
print(len(table_ids.intersection(tip_ids)))

The last three print statements return

3157
23574
1810

So, of the 3157 features in the table, only 1810 are present in the phylogeny - after making sure the underscores are in the IDs. Put another way, the phylogeny is missing 1347 features that are found in the table (but not the tree). I hope that helps!

3 Likes

Hi @ihoxie, so I had an epiphany recently when @thermokarst mentioned that the Newick format by design converts underscores to spaces if the ID is not placed into single quotes. This is mentioned inconsistently in Newick format documentation and when ghost-tree was developed I was unaware of this. The original UNITE IDs I was working with did not have underscores, so it just never came up. So your issues were 100% not your fault. Thanks to you and Matthew for working with me on this! I know it was time consuming. :slight_smile: :trophy:

You can find newly built and correctly formatted trees for the s_02.02.2019 and 02.02.2019 UNITE versions here.

I like the 80% clustered ones so that you don't discard a lot of unclassified organisms, but that's up to you depending on your project and desired accuracy vs lost data.

I wanted to make sure I could get your data through to Emperor so I checked it here using the original files you sent me and the feature table was filtered to remove IDs that are not found in the tree you gave me. Please note though, that as @thermokarst mentioned, you were using inconsistent databases accidentally, so I would make sure to use the same UNITE database as well as the corresponding ghost tree.

I

Let me know if you have any questions! :slight_smile:

2 Likes

Thank you so much for all your help and for updating everything!
Sorry I must not have caught the “s” in the different files

*Also as a side note, in case anyone else does this, for a while I kept trying “feature-table filter-samples” instead of the “filter-feature” against the accessions file so obviously it was just filtering out every sample…
Once I tried the new 2019 versions and replaced the accessions file spaces with underscores, it all worked!

2 Likes

@Jennifer_Fouquier Hi, I am new to qiime2 ITS analysis and was trying to use Fungal ITS analysis tutorial for analysing the fungal ITS using the UNITE database unite-ver7-99-seqs-01.12.2017. I could reach till taxonomic classification and would like to further analyse beta diversity using q2-ghost-tree. I have re-rooted the pre-built ghost-tree ghost_tree_80_qiime_ver7_99_01.12.2017/ghost_tree.nwk and got ghost-tree-midpoint-root.qza. Next, I have to filter my feature-table using
ghost_tree_80_qiime_ver7_99_01.12.2017/ghost_tree_extension_accession_ids.txt (If I understand the protocol correct). I looked for the
‘qiime feature-table filter-features’ arguments in https://docs.qiime2.org/2019.1/tutorials/filtering
but not able to understand what should be input parameters as we can only give one input for e.g our feature-table but how it will take reference from ghost_tree_extension_accession_ids.txt? I am bit confused at this point. Please guide me from ghost-tree filtering to beta diversity in qiime2.

2 Likes