feature_ids must be present as tip names

nick-youngblut · November 1, 2020, 4:22pm

I imported an taxon count table (tsv) by converting to biom and then importing to qiime. I imported a newick file of corresponding taxa (plus extra tips) via qiime import. When I run qiime diversity core-metrics-phylogenetic, I get the following error:

/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/31e4fc74/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:1575: DataConversionWarning: Data was converted to boolean for metric jaccard
  warnings.warn(msg, DataConversionWarning)
Traceback (most recent call last):
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/31e4fc74/lib/python3.6/site-packages/q2_diversity/_alpha/_method.py", line 54, in alpha_phylogenetic
    tree=phylogeny)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/31e4fc74/lib/python3.6/site-packages/skbio/diversity/_driver.py", line 170, in alpha_diversity
    counts, otu_ids, tree, validate, single_sample=False)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/31e4fc74/lib/python3.6/site-packages/skbio/diversity/alpha/_faith_pd.py", line 136, in _setup_faith_pd
    _validate_otu_ids_and_tree(counts[0], otu_ids, tree)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/31e4fc74/lib/python3.6/site-packages/skbio/diversity/_util.py", line 104, in _validate_otu_ids_and_tree
    " ".join(missing_tip_names)))
skbio.tree._exception.MissingNodeError: All ``otu_ids`` must be present as tip names in ``tree``. ``otu_ids`` not corresponding to tip names (n=1928): s__Agathobacter_sp900317585 s__Collinsella_sp900544725 s__Bacteroides_clarus s__CAG-83_sp001916855 s__Collinsella_sp900543515 s__CAG-485_sp900554845 s__UBA7160_sp002491565 ...

If I export both the feature table .qza & tree .qza files, I see that all of these labels actually overlap, so why am I getting this error?? According to qiime, ~20 of the "features" overlap between the count table and the tree. AFAIK, there's no way to easily get the intersect of features in a count table and tree artifact, correct? That would be helpful in situations like this.

I'm running qiime2-2019.10 installed via conda.

nick-youngblut · November 1, 2020, 4:31pm

I did some "manual" assessment of feature id overlap, and these are the only "features" to overlap between the count table and the tree:

s__14-2_sp000403845
s__Prevotella_timonensis
s__Intestinimonas_sp900540545
s__Muribaculum_sp003150235
s__Angelakisella_sp004554485
s__RC9_sp900546925
s__Enterocloster_sp900551225
s__Streptococcus_sp000448565
s__Clostridioides_difficile
s__RUG147_sp900315495
s__CAG-110_sp004555705
s__Intestinibacter_sp900540355
s__Slackia_A_sp900555495
s__UBA1777_sp900320465
s__CAG-411_sp000437275

I have no clue why only these would be found to overlap. All ~2000 taxa are present in the tree with the exact same labels.

nick-youngblut · November 1, 2020, 5:03pm

OK, I even pruned in the input newick tree to just all of the taxa in the taxon count table (used ape::drop.tip), and I still get:

All ``feature_ids`` must be present as tip names in ``phylogeny``. ``feature_ids`` not corresponding to tip names (n=1932): s__Collinsella_sp900551635 s__Parabacteroides_sp900552465 s__CAG-345_sp003497225 s__Phocaeicola_sp900541515 s__CAG-269_sp001916065 ...

I did check that the pruned tree and the feature table overlap in all labels prior to importing into qiime. I really don't get what the problem is.

jwdebelius · November 2, 2020, 8:23am

Hi @nick-youngblut,

I would check and see if there's any transformaton of your labels. IRC that there are certain characters the underlying packages may not play nicely with.

Can I ask, is this GTDB data? I might have an imported GTDB tree somewhere for an older version of QIIME, but Id need to do a little bit of digging for either the import code or the file (I have a vague sense of which parent folder, it's just a question of 8 subfolders ).

Best,
Justine

nick-youngblut · November 2, 2020, 9:39am

Thanks for the suggestions! The weird thing is that the only "special" character in the feature/tip names is -, which is present in some of the feature/tips that correctly overlap (the list of 15 shown above). So, the problem does not appear to be caused by special names. Also, when I export the tree .qza back to newick, the names are the same as before the newick => qza reformatting.

The data is GTDB.

jwdebelius · November 2, 2020, 10:10am

I've found my code... here appears to be what I did (no guarentees around anything)

Table:

s1_table = kraken_table.loc[kraken_table['lvl_type'] == 'S1'].copy()
s1_table.rename(columns=sample_ids.set_index('sample_col')['sample_name'].to_dict(), inplace=True)
s1_table.rename(columns={'name': 'feature-id'}, inplace=True)
s1_table.set_index('feature-id', inplace=True)
s1_table.rename({c: c.strip().replace("_", '-') for c in s1_table.index}, inplace=True)
s1_table.columns.set_names('sample-id', inplace=True)
s1_table.head()

Tree:

with open('phylogeny/gte50comp-lt5cont.nwk') as f_:
    a = f_.read()
a_new = a.replace("_", '-')
with open('phylogeny/gte50comp-lt5cont-dashes.nwk', 'w') as f_:
    f_.write(a_new)

I'd also be careful with the Braken table. . I've spent so much time trying to clean up braken tables recently.

Best,
Justine

nick-youngblut · November 2, 2020, 10:32am

Thanks for sharing the code! It's always interesting to see how complex the code is when using pandas versus the tidyverse. I'm wondering why you replaced underscores with dashes. I'd default to the opposite in most situations, given that underscores are generally "safer" characters.

nick-youngblut · November 2, 2020, 11:12am

I tried using qiime 2020.8.0, and I no longer get the same error as when I ran 2019.10. Instead, I get:

/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:1761: DataConversionWarning: Data was converted to boolean for metric jaccard
  warnings.warn(msg, DataConversionWarning)
More threads were requested than stripes. Using -761554400 threads.
Traceback (most recent call last):
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-421>", line 2, in core_metrics_phylogenetic
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 484, in _callable_executor_
    outputs = self._callable(scope.ctx, **view_args)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/q2_diversity/_core_metrics.py", line 66, in core_metrics_phylogenetic
    threads=n_jobs_or_threads)
  File "<decorator-gen-537>", line 2, in unweighted_unifrac
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "<decorator-gen-395>", line 2, in unweighted_unifrac
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/q2_diversity_lib/_util.py", line 49, in _disallow_empty_tables
    return wrapped_function(*args, **kwargs)
  File "<decorator-gen-394>", line 2, in unweighted_unifrac
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/q2_diversity_lib/_util.py", line 92, in _validate_requested_cpus
    return wrapped_function(*bound_arguments.args, **bound_arguments.kwargs)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/q2_diversity_lib/beta.py", line 159, in unweighted_unifrac
    variance_adjusted=False, bypass_tips=bypass_tips)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/unifrac/_methods.py", line 103, in unweighted
    variance_adjusted, 1.0, bypass_tips, threads)
  File "unifrac/_api.pyx", line 102, in unifrac._api.ssu
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/skbio/stats/distance/_base.py", line 106, in __init__
    self._validate(data, ids)
  File "/ebio/abt3_projects/software/dev/ll_pipelines/llmgp/.snakemake/conda/d64706f7/lib/python3.6/site-packages/skbio/stats/distance/_base.py", line 873, in _validate
    "Data must be symmetric and cannot contain NaNs.")
skbio.stats.distance._base.DistanceMatrixError: Data must be symmetric and cannot contain NaNs.

Maybe it still is a tip-feature matching issue, but there's no indication of that in the warning. The error is really not helpful at all.

When I run core-metrics instead of core-metrics-phylogenetic, there's no such error, and the job completes successfully. Therefore, it's likely a problem with the tree, but how do I figure out the issue?

nick-youngblut · November 2, 2020, 1:25pm

I split up core-metrics-phylogenetic into all of the separate commands. All work including qiime diversity alpha-phylogenetic, except for qiime diversity beta-phylogenetic with (un)weighted unifrac. For UniFrac, I still get: Data must be symmetric and cannot contain NaNs.

If it's something wrong with the tree, then it only affects UniFrac and not Faith's PD.

thermokarst · November 2, 2020, 2:37pm

Please double-check the following:

jwdebelius · November 2, 2020, 3:21pm

@nick-youngblut,

I find R and tidyverse baffling, so... IDK. Too many individual functions! I think I replaced them because that was what worked when i spent half a day playing with the data? Usually if it's under documented (and this unfortunately was) the answer is "it worked".

Best,
Justine

nick-youngblut · November 2, 2020, 3:34pm

@jwdebelius Yeah, it comes down to personal preference. Your pandas code is very concise, but it can be hard to understand without really looking at it. Whereas with tidyverse functions are quite self-explanatory (eg., select() selects columns and filter() filters rows). Regardless, thanks again for you help with this issue!

nick-youngblut · November 4, 2020, 7:31am

For those that run into this same issue: it only occurred when using qiime 2019.10 and not qiime 2020.8. For the 2020.8 version, I got the error Data must be symmetric and cannot contain NaNs.

This was due to my phylogeny containing NAs for a couple of branch lengths (probably introduced while manipulating the tree in R). The resolution to this issue can be found at: skbio.stats.distance._base.DistanceMatrixError: Data must be symmetric and cannot contain NaNs. · Issue #1733 · scikit-bio/scikit-bio · GitHub

system · December 5, 2020, 1:31pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.