Mismatched feature IDs

giant_virus · November 21, 2018, 2:53am

Dear qiime2 community!

I still have not figured out how to solve the problem of unmaching tree tips and sequence tables. Also discussed here

The user with the same problem did not post the solution.

The error I get is the same is in the post:

/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.2/lib/python3.5/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype int64 was
converted to bool by check_pairwise_arrays.
warnings.warn(msg, DataConversionWarning)
Traceback (most recent call last):
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.2/lib/python3.5/site-packages/q2_diversity/_alpha/_method.py", line 46, in alpha_phylogenetic
tree=phylogeny)
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/diversity/_driver.py", line 170, in alpha_diversity
counts, otu_ids, tree, validate, single_sample=False)
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/diversity/alpha/_faith_pd.py", line 136, in _setup_faith_pd
_validate_otu_ids_and_tree(counts[0], otu_ids, tree)
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.2/lib/python3.5/site-packages/skbio/diversity/_util.py", line 106, in _validate_otu_ids_and_tree
" ".join(missing_tip_names)))
skbio.tree._exception.MissingNodeError: All otu_ids must be present as tip names in tree. otu_ids not corresponding to tip names (n=23324): 4b0f96635e87ecb3e7903c0b4ab0bfb2abe2856a 5a299a483a1212f879ee358b2ec00495c256ef1f [omitting feature_ids]

So after this command the feature_ids have the same name:

qiime vsearch dereplicate-sequences
--i-sequences OUT_DIR/out4-qual-filter.qza
--o-dereplicated-table OUT_DIR/out5-derep-table.qza
--o-dereplicated-sequences OUT_DIR/out5-derep-sequences.qza

[I checked by exporting the files and manually checking some of the feature_ids.]

Hence, somewhere during these commands the feature_id's get changed:

qiime alignment mafft
--i-sequences out5-derep-sequences.qza
--o-alignment out5-derep-sequences_alignment_mafft.qza
--p-n-threads 8
qiime alignment mask
--i-alignment out5-derep-sequences_alignment_mafft.qza
--o-masked-alignment out5-derep-sequences_alignment_masked.qza
qiime phylogeny fasttree
--i-alignment out5-derep-sequences_alignment_masked.qza
--o-tree out5-derep-sequences_tree.qza
qiime phylogeny midpoint-root
--i-tree out5-derep-sequences_tree.qza
--o-rooted-tree out5-derep-sequences_rooted_tree.qza
qiime tools export
out5-derep-sequences_tree.qza
--output-dir ./

I played around with my data and used the tools feature-table summarize (the third tab, ‘Feature Detail’) and feature-table tabulate-seqs to visualize the names of my feature_ids. I then exported my tree and viewed it in MEGA. There I found why the tips do not match.

The tips of the tree are all changed the same way:
e0330553235196aa25c184cd0b1a1f8284706dd3
becomes
e0330553235196aa25c184cd0b1a1f8284706dd3 UU3micro-18S-12_S14_L001_132201

UU3micro-18S-12_S14_L001 is the name of one of the fasta files I used and _132201 is also added. It is not reads.

The problem is this that I do not know where the fasta-filename is added and why. I used mafft and fasttree outside of the pipeline, they never added names to the sequences. I think this is the only reason that I cannot get the qiime diversity core-metrics-phylogenetic command to run.

I also tried the command recommended in the moving pictures tutorial:

qiime phylogeny align-to-tree-mafft-fasttree

but I get this error (I should update my qiime version...):

Error: QIIME 2 plugin 'phylogeny' has no action 'align-to-tree-mafft-fasttree'.

So thanks for your help so far! I am struggling to export mafft data so until I figure this out I cannot know where it is changed and why. There is also the possibility that the exporting changes the tip names and the error lies somewhere else.

I will also post the solution if I can find it and update my Qiime2 version (its from February)!

Flo

thermokarst · November 21, 2018, 4:56am

Hey there @giant_virus!

This is bizarre. There is nothing in any of the QIIME 2 plugins that I know of that would do this to your Feature IDs.

That makes 2 of us!

You need 2018.8 or later for this pipeline.

We can't really provide support for old versions of QIIME 2 --- any chance you can upgrade and try again? If not, can you please send me a DM with the files you are working with, I would like to take a closer look.

Thanks! :qiime2:

giant_virus · November 21, 2018, 2:02pm

First of all, sorry for posting in the old thread, I was not sure if I should create a new one.

I contacted the techincal staff of our computer system, they will update qiime2, I will let you know if it fixed the problem!

Again thanks for the quick reply!

thermokarst · November 21, 2018, 2:21pm

No need to apologize at all --- it is super simple for us to split topics up. The reason we do that is to try and make things easier to search for on the forum.

giant_virus · November 22, 2018, 9:33am

Good evening!
I just finished updating my pipeline to the November release of qiime2! To be sure I repeated every step of the analysis to completely flush out all the files that were generated using the February release.

My pipeline looks like this:

qiime tools import
qiime cutadapt trim-paired
qiime vsearch join-pairs
qiime quality-filter q-score-joined
qiime vsearch dereplicate-sequences
importing silva as reference...
qiime vsearch uchime-ref
qiime feature-classifier classify-consensus-vsearch

Then I used:

qiime feature-table filter-features and qiime taxa barplot
I checked if the annotation work > it did very well! However I do not understand the filtering well. I am still trying to understand it using the docs and this answer.
qiime feature-table summarize and qiime feature-table tabulate-seqs
I have nice results, worked like a charm.
qiime phylogeny align-to-tree-mafft-fasttree
I get my rooted, unrooted tree and also the alingment and masked alignment.

However when I call:

qiime diversity core-metrics-phylogenetic
--i-table out5-derep-filtered_table.qza
--i-phylogeny out5-derep-seq_rooted-tree.qza
--p-sampling-depth 50000 \ (checked this with the exported table)
--m-metadata-file metadata_sheet.tsv
--p-n-jobs 16
--output-dir core-metrics-phylogenetic-output

Now I get the same error as earlier. I export the rooted tree (qiime tools export) and again the feature_ids were appended with the filenames. So at some point I must have made a mistake.

I will try to figure it out on my own, but I am thankful for every hint I can get! The update did not change the problem I have. I think that I made a mistake somewhere.

Thank you!

P.S.: Here is the errorlog:

/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.11/lib/python3.5/site-packages/sklearn/utils/validation.py:475
: DataConversionWarning: Data with input dtype float64 was converted to bool by check_pairwise_arrays.
warnings.warn(msg, DataConversionWarning)
Traceback (most recent call last):
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_diversity/_alpha/_me
thod.py", line 46, in alpha_phylogenetic
tree=phylogeny)
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.11/lib/python3.5/site-packages/skbio/diversity/driver
.py", line 170, in alpha_diversity
counts, otu_ids, tree, validate, single_sample=False)
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.11/lib/python3.5/site-packages/skbio/diversity/alpha/
faith_pd.py", line 136, in _setup_faith_pd
_validate_otu_ids_and_tree(counts[0], otu_ids, tree)
File "/usr/appli/freeware/miniconda/3.6/envs/qiime2-2018.11/lib/python3.5/site-packages/skbio/diversity/_util.p
y", line 104, in _validate_otu_ids_and_tree
" ".join(missing_tip_names)))
skbio.tree._exception.MissingNodeError: All otu_ids must be present as tip names in tree. otu_ids not
corresponding to tip names (n=20624): c4446b4a7ecb9ecca5747fe65dd9c941f21c4a37 [and another20.000 features]

thermokarst · November 27, 2018, 3:48am

Can you please share

out5-derep-filtered_table.qza

and

out5-derep-seq_rooted-tree.qza

This would really help lock down exactly what is going on here by letting us just look at the decentralized provenance. If you don't want to share here feel free to send links to me in a direct message. Thanks!

thermokarst · November 28, 2018, 4:26am

Thanks for sending along these data, that certainly helps! Provenance looks fine from this end. I am suspicious of vsearch though --- can you run the feature-table tabulate-seqs command on the FeatureData[Sequence] artifact returned from vsearch dereplicate-sequences? What do the Feature IDs looks like there?

giant_virus · December 4, 2018, 3:07am

Hello and sorry for the long wait,
I think the feature IDs look okay.
I used this command to generate the visualization below.

qiime feature-table tabulate-seqs --i-data out5-derep-sequences.qza --o-visualization out5-derep-sequences-vis.qzv

Here is a screenshot of the visualization:

out5-derep-sequences.qza was generated using vsearche's dereplicate sequences.
I hope this helps, thank you for your help!

thermokarst · December 4, 2018, 5:43pm

No worries, thanks for following up!

I agree, they look perfectly normal.

Hmmm, I am out of ideas. I have been unable to reproduce the weird feature ID issue you are showing. For example, just look at the Moving Pictures tree, it is totally fine!

Is it possible that you have a different version of mafft or fasttree in your PATH somewhere? One way to check is to start a new terminal with no conda env active, then run:

env

Copy-and-paste the "before" results of that command.

Then, activate your QIIME 2 environment, and rerun:

env

Copy-and-paste the "after" results, then we can compare them.

Thanks!

thermokarst · December 4, 2018, 6:05pm

Oh hey @giant_virus, can you also send the QZV for the screenshot above? out5-derep-sequences-vis.qzv

thermokarst · December 5, 2018, 4:35am

Thanks for sending the viz in a DM! Okay, guess what? Even though the viz looks okay, if I pull out the FASTA file from within:

>3cafeb270ab9f7183bbc2e7c24b7cc1ffb2f196c UU3micro-18S-12_S14_L001_404900
CCTGAAAGCCGGTAATGACTTTCTCGCGTCAAACCGCGAAAAGCCAGGCGTGACCGAACTCCTCAGCGGACTCCAGTACGAAGTGATCCACATGGGTGACGGAGCCAAACCCTGGCCCACCAGCAAAGTGACCTGCCATTACCAT
>bbfdbd5a45738494ef2a3fc5f95a878bfb9f8475 UU3micro-18S-12_S14_L001_404902
ATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGACTGGTTCATCGAGCTTACAACCTTGACCGAGAGGTCTGGGTAATCTTTTTAAAGCCAGTCGTGATGGGGATAGATTATTGCAATTTTTAATCTTCAACGAGGAATTCCTAGTAGACGCAGGTCATCAACCTGCATCGATTACGTCCCTGCCCTTTGTACACGCCGCCCGTCGCTACTACCGATTGAATGGCTTAGTGAGCCCTCTGGACTGGTGCACGGCGTTGGAAACTTCGCCGCGCGTTCAGGAAGGAGGTCAAACTTGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCC
>bc69f76bfcb0deac1cb364f08edd54725d7d47d1 UU3micro-18S-12_S14_L001_404918
ATAACAGGTCTGTGATGCCCTTAGATGTTTTGGGCTGCACGCGTGCTACACTGGTTTAATTAACGAGCTGCTGGTCTTGTTTGAAAGCGTGGGGTAAACTTTAATGTAAATCGTGATTGGGGTGGATTGTTGCAATTATTGATCTTGAACGAGGAATTCCTAGTACGCCGAAGTCATCAGCTTGGGCTGACTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGAGTAATCCGGTGAAATGCTTGGCTTGGCACAGTGGTCATAAATGAGTGTTGTGCAACAAGTGCTTTGAACCTTGTTACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCT

Ah hah! vsearch, you scoundrel ! Okay, so the main issue appears to be that dereplicate-sequences is modifying the feature IDs. Secondly, the tabulate-seqs viz should show the entire feature ID, not just the first word found in it.

Okay - workarounds...

You could cluster de novo at 100%. This will keep your Features more or less the same (for example, I ran this as a check on the dereplicated outputs for the Moving Pictures tutorial dataset, I started with 229143 features, after clustering, 229137).

If that is not acceptable for you, your other options are to choose a different pipeline (e.g. DADA2), or, clean up your feature IDs in some kind of external script or tool.

Thanks for working with us on this one! :qiime2:

giant_virus · December 5, 2018, 6:31am

I just went through the moving pictures tutorial and the tree does not show appended feature id names.
So I actually started doing exactly what you recommended! Thanks a lot man, I will just not use vsearch and continue!

Thanks for all the time you put into helping me! I hope the Qiime2 team can also benefit from this Odyssey!

edit: I could have pulled the fasta file too! I was not paying enough attention, next time I will try to not rely on the implemented visualization (especially when a mod tells me that the behavior of the software is weird!). Thank you again!

system · January 5, 2019, 12:31pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.