`qiime dada2 denoise-paired` naming bug

nick-youngblut · October 9, 2018, 8:00pm

It appears that qiime dada2 denoise-paired can incorrectly match the paired read files of samples if they have identical names, except one name includes "_[0-9]" at the end of the name. For instance, a dada2 job I recently ran generated the following error:

# excerpt 
1) Filtering Error in filterAndTrim(unfiltsF, filtsF, unfiltsR, filtsR, truncLen = c(truncLenF,  :
  These are the errors (up to 5) encountered in individual cores...
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 6069, 295.

...and exporting the demultiplexed seqs .qza file (the input for the dada2 job) generated the following sequence files:

# file_name: number_of_sequences
./V130_166_L001_R1_001.fastq.gz: 295
./V130_2_167_L001_R1_001.fastq.gz: 6069
./V130_2_743_L001_R2_001.fastq.gz: 6069
./V130_742_L001_R2_001.fastq.gz: 295

So, it appears that dada2 matched the samples based on the sort order, which incorrectly combined "V130 R1" with "V130_2 R2" and "V130_2 R1" with "V130 R2". I would imagine that this command matches samples based on the "sample-id" value as listed in the MANIFEST. I checked the manifest and the "V130" + "V130_2" samples seem be to labeled correctly for the correct read files. So, does dada2 just match read files based on sort order, because this would cause incorrect sample <--> read_file mapping?

Another section of the dada2 error was:

# excerpt 
Error in (function (fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0,  :
  Mismatched forward and reverse sequence files: 2768, 7554.

...and this corresponded with the following samples:

# file_name: number_of_sequences
./V378_1016_L001_R2_001.fastq.gz: 7554
./V378_2_1017_L001_R2_001.fastq.gz: 2768
./V378_2_441_L001_R1_001.fastq.gz: 2768
./V378_3_1018_L001_R2_001.fastq.gz: 12390
./V378_3_442_L001_R1_001.fastq.gz: 12390
./V378_440_L001_R1_001.fastq.gz: 7554

So, it seems to be another case of sample <--> read_file mismatch.

$ qiime info
System versions
Python version: 3.5.5
QIIME 2 release: 2018.6
QIIME 2 version: 2018.6.0
q2cli version: 2018.6.0

Installed plugins
alignment: 2018.6.0
composition: 2018.6.0
cutadapt: 2018.6.0
dada2: 2018.6.0
deblur: 2018.6.0
demux: 2018.6.0
diversity: 2018.6.0
emperor: 2018.6.0
feature-classifier: 2018.6.0
feature-table: 2018.6.0
gneiss: 2018.6.0
longitudinal: 2018.6.0
metadata: 2018.6.0
phylogeny: 2018.6.0
quality-control: 2018.6.1
quality-filter: 2018.6.0
sample-classifier: 2018.6.0
taxa: 2018.6.0
types: 2018.6.0
vsearch: 2018.6.0

Application config directory
/ebio/abt3/nyoungblut/.config/q2cli

Getting help
To get help with QIIME 2, visit https://qiime2.org

ebolyen · October 9, 2018, 9:24pm

This is some wonderful sleuthing @nick-youngblut!

QIIME 2 does split the forward and reverse reads into two different folders, however DADA2 must use alphabetical sorting to pair them together. Since your ID scheme sometimes has an _N and sometimes does not, it ends up sorting with the barcode segment which throws everything out of whack.

I don't have a good idea on how to fix this yet in our code, but in the meanwhile, changing your scheme to always have an _N (perhaps _0 when there isn't "supposed" to be that segment) should cause the sort order to be correct.

Great find, thanks for bringing this to our attention!

ebolyen · October 10, 2018, 6:34pm

Issue created here:
https://github.com/qiime2/q2-dada2/issues/102

system · November 11, 2018, 12:40am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.