Add readId-fastqId map file to dada2 output

kevinmcc21 · August 9, 2018, 7:18pm

I'm familiar with DADA2 and how it denoises reads, and I understand why the output from DADA2 uses hashed sequence identifiers. However, I am often finding myself wanting to compare results between DADA2 (QIIME2) and other analysis pipelines, which often includes tracking individual reads through the pipeline. It would be incredibly useful if the DADA2 step (optionally) included an output table that links FASTQ IDs to hashed sequence IDs used in downstream analysis. E.g.:
[hashed sequence id 1],[FASTQ id 1]
[hashed sequence id 1],[FASTQ id 2]
...
[hashed sequence id 2],[FASTQ id 101]
Having such a file would allow me to query specific read IDs to see what taxonomy they were ultimately classified as in the QIIME2 pipeline.

thermokarst · August 10, 2018, 4:54pm

This is an interesting idea, @kevinmcc21 - it looks like it might fit under this open issue: support FeatureData[Sequences] (OTU Map) · Issue #92 · qiime2/q2-types · GitHub.

It sounds like this might specifically be a DADA2 feature request though --- maybe you should open an issue on the official DADA2 tracker?

kevinmcc21 · August 10, 2018, 5:39pm

Thanks for the input. It seems it is already an open issue for DADA2:
https://github.com/benjjneb/dada2/issues/354

The developer discussion is much more involved than what I'm looking for, but the general idea seems the same.

Regarding the open Qiime issue you mentioned, I am not sure they are so similar. That one seems more focused on creating a certain data structure or file format, and I see no mention of FASTQ IDs. I am not that familiar with the overall Qiime workflow though so I could be wrong.