Hi
I have been using QIIME 1 for a while now and I’m finally switching to QIIME2 out (v 2019.1). I was trying to compare Q1 and Q2 OTUs with ASVs but I’m having some problems when running Sepp for the phylogenies constructed from OTUs. I used QIIME2 from scratch to create the rep_set qza with the open-reference pipeline. The thing is, sepp carries out some checking step in which it decides I have repeated tips and aborts the phylogeny construction.
I’ve looked down in the forums and it seemed to me it could be something weird about some of my rep seqs having the same name (number, actually) as the reference I used (green genes 2013_08), so I altered the names of each of them (added a set of characters as prefix) and sepp runned without errors (sorry but I did not save the original error it prints).
Now, when working with the core-metrics pipeline, I tried importing a tree from Q1 and had a similar problem as it prints the following error traceback:
File “/home/rodrigo/.conda/envs/qiime2-2019.1/lib/python3.6/site-packages/skbio/diversity/_util.py”, line 93, in _validate_otu_ids_and_tree
raise DuplicateNodeError(“All tip names must be unique.”)
skbio.tree._exception.DuplicateNodeError: All tip names must be unique.
I checked my files and they are in fact unique nemes, both in the table and the trees. I then changed the names in my otu tables to match those in my sepp newick tree and it worked again.
There seems to be some odd behaviuour in the _validate_otu_ids_and_tree algorithm that does not allow it to continue. I’ve already figured out how to get around this but could you look at it so it can get fixed in later versions, please?
Thanks so much for the report. A name match between rep-seqs and reference seems plausible to me for causing this issue, as that would certainly be a rare thing to see. And the following errors all make sense considering the changes you made (to clarify, you were able to get things working eventually right? after a few cycles of renames, otherwise, we can help!)
Would you by chance be able to provide a little bit of data which replicates this? Only a few reads should suffice I think (i.e. just the ones you had to rename). The md5sum of your reference data would also be good, that way I can be sure I’m testing against the same one. Which OTUs did you use from Greengenes? 97% or 99%?
Sorry for the delay, I had to ask our collaborators first about sharing the data. Just to recap and explain a little more, when trying to compare OTUs and ASVs I found a problem with the former. For these, I carried out an open-reference clustering with the green genes clusters at 97% identity. The resulting rep-set failed at the phylogenetic reconstruction with sepp due to names being the same as in the reference. I think the reference-based cluster names were at fault as changing these (I added a "xxxxx" prefix before each header) solved the problem. Now I'm sending three files that are subsets of our actual data:
x1000 representative sequences without the naming issue:
xb6b4398233ec68a9ea584c0eb1716d5b accepted_headers.qza
x Link missing: It didn't allow me to put more than 2 links as this account is new
1000 sequences that trigger the error
8e0dfa7d4705cb7ae56ea80869c44ff0 problematic_headers.qza problematic_headers.qza (77.2 KB)
The same bad-name sequences after fixing the headers (no error now)
438c6de1fc83e188fb8ac94e817bcb48 fixed_problematic_headers.qza fixed_problematic_headers.qza (77.3 KB)
I managed to build the tree after fixing the problematic headers but then I had to change our contingency table to reflect those changes in the tree in order to use the tree during beta div analyses. It all worked as expected afterwards.
From what I've seen, the problem seems to be that the reference-based clustering of the open-reference algorithm assigns the exact names in the database (in this case, greengenes 97% clusters). I think this triggers the error because the reference tree for sepp has matching labels (tip names).
Anyway, I hope the files are useful for a patch in later qiime2 versions.
Thanks!
Rodrigo
This is the missing file from my last post (It won't allow me to put more than 2 links per post):
1000 representative sequences without the naming issue:
b6b4398233ec68a9ea584c0eb1716d5b accepted_headers.qza accepted_headers.qza (108.0 KB)
Thanks so much @rodrigogarlop! I’ll add this to my todo list, and hopefully fix it for the upcoming release (if there’s time). I’ll leave this assigned to myself for now.