I have some sputum samples that we sent for sequencing (all patients with COPD). So, I was able to import my data and proceeded with sequence quality control and feature table construction. I used DADA2 with no problems. As I’m still learning how to use Qiime2, I also tried with Deblur. For the first step, I did not have any problems. For the second step, using Deblur, I received an error message saying that I had duplicate Sample IDs. Again, I had no problem using DADA2.
I do not know if this matters, but all my samples start with COPD… Additionally, I checked all my sample IDs, and obviously there are no duplicated IDs.
Thank you for report. What I believe is going on is that Deblur is treating _ as a special denotation character, as was done in QIIME1 for demultiplexed output, which Deblur was originally designed to process. In brief, the demultiplexed output in QIIME1 required that sequence identifiers conform to <SAMPLEID>_<INTEGER>, and the sample ID associated with the sequence was determined by splitting the sequence identifier into its sample ID and a unique integer components. My guess right now is that Deblur is not handling the per-sample files provided by QIIME2 correctly in this case, and is still assuming _ is a reserved character.
This raises two issues. First, Deblur or q2-deblur should possibly be insensitive to this, and I’ve opened an issue to explore it. Possibly, because previously in QIIME, _ was not allowed to be used as a character in a sample identifier, and this restriction is I believe still enforced in Qiita. What I’m wondering now is whether this reserved character should persist in QIIME2 given that there is at least one, and probably many, tools in the QIIME ecosystem which make this assumption.
Traceback (most recent call last):
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2cli/commands.py", line 222, in call
results = action(**arguments)
File "", line 2, in denoise_16S
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
output_types, provenance)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 334, in callable_executor
output_views = callable(**view_args)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 96, in denoise_16S
hashed_feature_ids=hashed_feature_ids)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 177, in _denoise_helper
table = _load_table(tmp)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 59, in _load_table
table.update_ids(sid_map, axis='sample', inplace=True)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/biom_format-2.1.5-py3.5-linux-x86_64.egg/biom/table.py", line 1069, in update_ids
errcheck(result)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/biom_format-2.1.5-py3.5-linux-x86_64.egg/biom/err.py", line 472, in errcheck
raise ret
biom.exception.TableException: Duplicate sample IDs!
Sure, I'm happy to discuss on the dev call this week!
Just a quick note: @thermokarstcreated an issue on q2-deblur to track progress on this (e.g. in case the fix doesn't make it into deblur but instead happens in q2-deblur).
The QIIME 1 demux format spec doesn't mention underscore being a reserved character, or if sample IDs can/cannot contain underscores. I realize in QIIME 1 the sample IDs weren't generally supposed to contain underscores, but in QIIME 2 we already support sample IDs with underscores (in Metadata and a variety of file formats). Unfortunately I don't think we can generally make underscore a reserved character because there are already .qza files with these kinds of IDs in the wild (we see users on the forum regularly using sample IDs with underscores).
This issue you raised is actually a more general consideration when writing QIIME 2 plugins. Each underlying tool (e.g. mafft, FastTree, deblur, dada2, ...) has different requirements and assumptions about the data being passed to it. The way to go (for now at least) is to have the plugin munge the data into an acceptable format for the underlying tool, and then munge the output (if necessary) to hand back to QIIME 2.
A concrete example of this is qiime alignment mafft. The underlying MAFFT program only works with FASTA header IDs that are shorter than ~250 characters, otherwise the IDs are truncated in the output alignment. To deal with this, q2-alignment re-maps the output IDs back to the originals after MAFFT executes. We have to support long header IDs for the case when qiime dada2 denoise-single/paired is run with no-hashed-feature-ids, in which case the header ID is the sequence itself.
Another example is the PHYLIP suite of programs, which have even stricter requirements about ID length. When porting the Alignment object from PyCogent into scikit-bio, we retained a method that re-mapped IDs to ascending integers so that data could be passed to PHYLIP. I'm not sure if QIIME 1 used this approach but there seems to be some sort of historical precedent to re-map IDs when necessary.
MIENS-compliant sample IDs aren't guaranteed to work with all tools, so the general problem I described above still stands. See my response above for two examples of this -- both MAFFT and PHYLIP don't work with longer IDs, even if they are MIENS-compliant. The issue is especially apparent with PHYLIP: IDs can only be up to 10 characters in length, so clearly a MIENS-compliant sample ID could still be incompatible with a plugin running PHYLIP. Thus, plugins need to be responsible for re-mapping IDs -- there isn't an ID naming standard that will satisfy the requirements of all bioinformatics tools that are wrapped by QIIME 2.
@fstudart, apologies that the conversation got a bit sidetracked. To bring it back to the issue you’re having:
Since q2-deblur doesn’t support sample IDs with underscores, you’ll need to rename your sample IDs so that they don’t contain underscores. Unfortunately this means renaming the sample IDs in your metadata file and also the filenames of your raw fastq sequence data (or your fastq manifest file, depending on how you imported your data). Basically this means starting your analysis over. If you don’t wish to do this, you could proceed with q2-dada2 or use an external program to create a feature table, which can then be imported into QIIME 2.