Error in diversity (beta_phylogenetic)

uria · January 14, 2018, 10:58pm

Hey there,

I get a 'MissingNodeError' when trying to calculate beta diversity on a merged feature table.
I have several runs I want to compare samples from, so I did the first upstream parts (up to dada2) for each run separately, and for each run I got representative sequences qza and feature table qza.

I merged the rep-seqs qza files using CLI

qiime feature-table merge-seqs  \
    $(for rpsq in $(ls ./*/*rep-seqs.qza);do echo --i-data $rpsq ;done) \
    --o-merged-data merged_repseqs.qza

And successfully created merged_repseqs.qza

Afterwards, I created a rooted phylogenetic tree as described in the “Moving Pictures” tutorial. That is:

qiime alignment mafft --i-sequences merged_repseqs.qza --o-alignment aligned_merged_repseqs.qza --p-n-threads 50 --verbose;

qiime alignment mask --i-alignment aligned_merged_repseqs.qza --o-masked-alignment masked_aligned_merged_repseqs.qza --verbose; 

qiime phylogeny fasttree --i-alignment masked_aligned_merged_repseqs.qza --o-tree unrooted-tree.qza --verbose --p-n-threads 50 --verbose;

qiime phylogeny midpoint-root   --i-tree unrooted-tree.qza   --o-rooted-tree  rooted-tree.qza --verbose;

(Successfully finished without errors).

As for the feature tables (frequencies), since I want easier way to query the data for relevant samples (and mostly because I'm way more comfortable working with python rather than using bash) I joined the feature table using the Artifact API and pandas:

tables_artifacts = (Artifact.load(p) for p in tables_paths)    # create generator of artifacts
tables_dataframes = (a.view(pd.DataFrame) for a in tables_artifacts)    # generate dataframe views out of artifacts
all_samples_dataframe = pd.concat(tables_dataframes).fillna(0)    #  evaluate and concatenate tables
                                                                  # fill NaN's with zeros for the sake skbio's nature of throwing annoying warnings

some_samples_dataframe = all_samples_dataframe.loc[....]    # query whatever I need from the full table

# create a 'FeatureTable[Frequency]'  artifact  out of the table of interest
merged_table_art = Artifact.import_data('FeatureTable[Frequency]'     
                                        ,some_samples_dataframe,view_type=pd.DataFrame)    
bdv = qiime2.plugins.diversity.methods.beta_phylogenetic(table= merged_table_art
                                                         , metric="unweighted_unifrac"
                                                         , phylogeny=rooted_phylogeny)
# where rooted_phylogeny is a 'Phylogeny[Rooted]' artifact loaded from the merged_repseqs.qza file.

Now, to my understanding, every feature in my table should be also present in the rooted tree, but alas I get this err message:

---------------------------------------------------------------------------
MissingNodeError                          Traceback (most recent call last)
~/.conda/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_diversity/_beta/_method.py in beta_phylogenetic(table, phylogeny, metric, n_jobs)
     69             pairwise_func=sklearn.metrics.pairwise_distances,
---> 70             n_jobs=n_jobs
     71         ) 
~/.conda/envs/qiime2-2017.12/lib/python3.5/site-packages/skbio/diversity/_driver.py in beta_diversity(metric, counts, ids, validate, pairwise_func, **kwargs)
    347         metric, counts_by_node = _setup_multiple_unweighted_unifrac(
--> 348                 counts, otu_ids=otu_ids, tree=tree, validate=validate)
    349         counts = counts_by_node

~/.conda/envs/qiime2-2017.12/lib/python3.5/site-packages/skbio/diversity/beta/_unifrac.py in _setup_multiple_unweighted_unifrac(counts, otu_ids, tree, validate)
    484     counts_by_node, _, branch_lengths = \
--> 485         _setup_multiple_unifrac(counts, otu_ids, tree, validate)
    486 

~/.conda/envs/qiime2-2017.12/lib/python3.5/site-packages/skbio/diversity/beta/_unifrac.py in _setup_multiple_unifrac(counts, otu_ids, tree, validate)
    448     if validate:
--> 449         _validate_otu_ids_and_tree(counts[0], otu_ids, tree)
    450 

~/.conda/envs/qiime2-2017.12/lib/python3.5/site-packages/skbio/diversity/_util.py in _validate_otu_ids_and_tree(counts, otu_ids, tree)
    105                                (n_missing_tip_names,
--> 106                                 " ".join(missing_tip_names)))
    107 

MissingNodeError: All ``otu_ids`` must be present as tip names in ``tree``. ``otu_ids`` not corresponding to tip names (n=15002): e2cc357ffe57e5d5d20d4cc929a9803e 40570145f37809857b3fd113bedfe52a e0b19cf3a8136f7a6bb5e569a71030e5 f53a1bf1752fc1438d5f3211c9a269a0  ...)

I also tried to build the rooted phylogeny tree without the masking step (to make sure all features are included in the tree), and got the same error.
What did help, was removing features that didn't sum up to at least 20 across all of the samples (following the bottom line in the linked issue), but since I'm working with ~2000 samples, the number of features I loose is roughly 20000 out of 40000 I don't want to loose that many.

I want to understand, weather I'm not building the phylogeny tree correctly (losing to much information somewhere down the road), or should I actually remove some features prior to beta phylogeny?

Thanks,
Uria

edit

I have found the qiime phylogeny filter-table... is this what I should use?

thermokarst · January 17, 2018, 3:11pm

Hi @uria!

There is a lot going on here, so I will try to unpack one step at a time.

First, while your approach to merging feature tables works, it has some drawbacks: computational overhead, and loss of provenance. There is a method for merging feature tables --- I would recommend taking a look at that! Also, you have the ability to filter your feature table using feature-table filter-samples or feature-table filter-features, which again, will allow you to retain provenance.

As for the actual error you reported - this means that there are features present in your feature table that aren't present in your phylogenetic tree. There appears to be an issue with some of the phrasing in that error (otu_id vs feature), so I have reported that.

As far as your approach of summing up features across samples, something doesn't seem quite right there. I would suggest taking a step back and doing the following:

Merge all feature tables using QIIME 2
Merge all sequences using QIIME 2
Create a phylogenetic tree using the merged sequences
Skip any filtering for now, just for the sake of rooting out any issues
Run beta-phylogenetic using your merged feature tables and your new phylogenetic tree.
If you still see issues, can you share that following summary visualizations?

feature-table summarize for your merged feature table.
feature-table tabulate-seqs for your merged sequences.

Thanks!

uria · January 17, 2018, 4:06pm

Hey @thermokarst, thanks for the kind reply.

What summing approach are you referring to in:

thermokarst:

As far as your approach of summing up features across samples, something doesn’t seem quite right there
Regarding the loss of provenance, I really love the idea of being able to monitor the process, but I really can't use it if it isn't available in the python API. If I had up around to 10 sequencing runs to work with, then the CLI/Artifact API would do. But I'm working with a far greater number of runs, and the nature of the samples varies a lot (also within run), so if I want to do it within Qiime I'll have to create hundreds of metadata and feature table files is simply not feasible (I tried). So provided that feature-table merge is out of the question (unless there's a way to tell it exactly which samples to take from each run), how would you suggest to query&merge data while avoiding lose of that so (really) precious provenance?

Thanks,
Uria

thermokarst · January 17, 2018, 4:19pm

Hi @uria!

This one:

I might be misunderstanding, but that approach seems suspect to me. Please see my questions at the end of this reply.

I'm not sure I follow - every plugin method and visualization is available across all interfaces (with a few exceptions in QIIME 2 Studio, because that is a prototype). In fact, the links I sent earlier have tabs that show the Artifact API usage!

Is there something else preventing you from being able to use this method?

I am also a bit lost here - why do you need to create hundreds of metadata files? Sorry if I am just overlooking something!

You can control the overlap method when using feature-table merge, check out the docs for the options for handling overlapping IDs. Once you merge the tables, you could filter out any samples you don't want using feature-table filter-samples. Alternatively, you could filter each table prior to merging, which would be pretty easy to do programmatically using the Artifact API.

Ultimately, what exactly are you trying to do? Do you have multiple studies pooled across multiple runs? What are the samples you are trying to remove - are they only found in single runs, or are there replicates across multiple runs? Once we have a better idea of what you are working with, we can provide more concrete steps on how to move forward.

Thanks!

system · February 17, 2018, 10:21pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.