Deblur problem - Duplicate Sample IDs error

Hi everyone,

I have some sputum samples that we sent for sequencing (all patients with COPD). So, I was able to import my data and proceeded with sequence quality control and feature table construction. I used DADA2 with no problems. As I’m still learning how to use Qiime2, I also tried with Deblur. For the first step, I did not have any problems. For the second step, using Deblur, I received an error message saying that I had duplicate Sample IDs. Again, I had no problem using DADA2.

I do not know if this matters, but all my samples start with COPD… Additionally, I checked all my sample IDs, and obviously there are no duplicated IDs.

COPD_092A
COPD_111C
COPD_114A
COPD_120A
COPD_127A
COPD_134A
COPD_15_2D

Any thoughts about this?

Thanks very much,
FS

1 Like

Hi @fstudart,

Thank you for report. What I believe is going on is that Deblur is treating _ as a special denotation character, as was done in QIIME1 for demultiplexed output, which Deblur was originally designed to process. In brief, the demultiplexed output in QIIME1 required that sequence identifiers conform to <SAMPLEID>_<INTEGER>, and the sample ID associated with the sequence was determined by splitting the sequence identifier into its sample ID and a unique integer components. My guess right now is that Deblur is not handling the per-sample files provided by QIIME2 correctly in this case, and is still assuming _ is a reserved character.

This raises two issues. First, Deblur or q2-deblur should possibly be insensitive to this, and I’ve opened an issue to explore it. Possibly, because previously in QIIME, _ was not allowed to be used as a character in a sample identifier, and this restriction is I believe still enforced in Qiita. What I’m wondering now is whether this reserved character should persist in QIIME2 given that there is at least one, and probably many, tools in the QIIME ecosystem which make this assumption.

@jairideout @antgonza, perhaps we should discuss on this weeks call?

Best,
Daniel

@fstudart, if you have a moment, would you be able to copy and paste the exception you’re receiving?

Hi Daniel.

Sure. At first, I'd like to thank you for all your support.

I ran the deblur command again with the verbose option and I got this:

qiime deblur denoise-16S \

--i-demultiplexed-seqs demux-filtered.qza
--p-trim-length 240
--o-representative-sequences rep-seqs-deblur.qza
--o-table table-deblur.qza
--o-stats deblur-stats.qza
--verbose

Traceback (most recent call last):
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2cli/commands.py", line 222, in call
results = action(**arguments)
File "", line 2, in denoise_16S
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
output_types, provenance)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 334, in callable_executor
output_views = callable(**view_args)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 96, in denoise_16S
hashed_feature_ids=hashed_feature_ids)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 177, in _denoise_helper
table = _load_table(tmp)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_deblur/_denoise.py", line 59, in _load_table
table.update_ids(sid_map, axis='sample', inplace=True)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/biom_format-2.1.5-py3.5-linux-x86_64.egg/biom/table.py", line 1069, in update_ids
errcheck(result)
File "/home/fstudart/anaconda3/envs/qiime2-2017.8/lib/python3.5/site-packages/biom_format-2.1.5-py3.5-linux-x86_64.egg/biom/err.py", line 472, in errcheck
raise ret
biom.exception.TableException: Duplicate sample IDs!

Plugin error from deblur:

Duplicate sample IDs!

See above for debug info.

Thanks,
FS

1 Like

An off-topic reply has been split into a new topic: Duplicate sample IDs (with underscores) in Casava format

Please keep replies on-topic in the future.

Sure, I'm happy to discuss on the dev call this week!

Just a quick note: @thermokarst created an issue on q2-deblur to track progress on this (e.g. in case the fix doesn't make it into deblur but instead happens in q2-deblur).

The QIIME 1 demux format spec doesn't mention underscore being a reserved character, or if sample IDs can/cannot contain underscores. I realize in QIIME 1 the sample IDs weren't generally supposed to contain underscores, but in QIIME 2 we already support sample IDs with underscores (in Metadata and a variety of file formats). Unfortunately I don't think we can generally make underscore a reserved character because there are already .qza files with these kinds of IDs in the wild (we see users on the forum regularly using sample IDs with underscores).

This issue you raised is actually a more general consideration when writing QIIME 2 plugins. Each underlying tool (e.g. mafft, FastTree, deblur, dada2, ...) has different requirements and assumptions about the data being passed to it. The way to go (for now at least) is to have the plugin munge the data into an acceptable format for the underlying tool, and then munge the output (if necessary) to hand back to QIIME 2.

A concrete example of this is qiime alignment mafft. The underlying MAFFT program only works with FASTA header IDs that are shorter than ~250 characters, otherwise the IDs are truncated in the output alignment. To deal with this, q2-alignment re-maps the output IDs back to the originals after MAFFT executes. We have to support long header IDs for the case when qiime dada2 denoise-single/paired is run with no-hashed-feature-ids, in which case the header ID is the sequence itself.

Another example is the PHYLIP suite of programs, which have even stricter requirements about ID length. When porting the Alignment object from PyCogent into scikit-bio, we retained a method that re-mapped IDs to ascending integers so that data could be passed to PHYLIP. I'm not sure if QIIME 1 used this approach but there seems to be some sort of historical precedent to re-map IDs when necessary.

1 Like

Thanks! This is perfect. The duplicate ID issue stems from here, and it is q2-deblur which assumes a QIIME1 sample ID structure.

@jairideout, it’s a MIENS compliance issue, as noted in validate_mapping_file.py, also in check_id_map.py. Deviating from this standard will create bugs and pain.

MIENS-compliant sample IDs aren't guaranteed to work with all tools, so the general problem I described above still stands. See my response above for two examples of this -- both MAFFT and PHYLIP don't work with longer IDs, even if they are MIENS-compliant. The issue is especially apparent with PHYLIP: IDs can only be up to 10 characters in length, so clearly a MIENS-compliant sample ID could still be incompatible with a plugin running PHYLIP. Thus, plugins need to be responsible for re-mapping IDs -- there isn't an ID naming standard that will satisfy the requirements of all bioinformatics tools that are wrapped by QIIME 2.

I'm happy to discuss this more on the dev call!

2 Likes

@fstudart, apologies that the conversation got a bit sidetracked. To bring it back to the issue you’re having:

Since q2-deblur doesn’t support sample IDs with underscores, you’ll need to rename your sample IDs so that they don’t contain underscores. Unfortunately this means renaming the sample IDs in your metadata file and also the filenames of your raw fastq sequence data (or your fastq manifest file, depending on how you imported your data). Basically this means starting your analysis over. If you don’t wish to do this, you could proceed with q2-dada2 or use an external program to create a feature table, which can then be imported into QIIME 2.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.