Discrepency in read counts between QIIME2 Import and Dada2 in R

reige012 · September 21, 2021, 1:11pm

I'm working on a small dataset from a Miseq run and have some questions. I used the standard import function in QIIME2 for demultiplexed paired-end reads and it worked fine with no errors. However, ended up with really low read counts. This surprised my colleague who used dada2 in R to join the reads and we discovered a discrepancy. My read counts are almost 10,000 less per sample than his and thats before using the Dada2 plugin in QIIME2 for trimming and chimera removal etc. I can't see any information anywhere about changing the import parameters or if there is a standard quality setting for the import step. It just seems very strange that we would have such large differences between the two methods.

This are the import commands I used in QIIME2:

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path /RawReadsSponge/reads/ --source-format CasavaOneEightSingleLanePerSampleDirFmt --output-path SpongeVacuSIP_demux.qza

qiime demux summarize --i-data SpongeVacuSIP_demux.qza --o-visualization SpongeVacuSIP_demux.qzv

qiime tools export SpongeVacuSIP_demux.qzv --output-dir demultiplexed.sequences.per.sample.view

The demux visualization file shows good quality on the paired reads (>28 until 245 bp) and I would only trim off 5 to 10 bp at the end of the reverse read in the dada plugin step.

My colleague is using the following R commands:

library(dada2); packageVersion("dada2")
#File parsing
pathF <- "FWD" # CHANGE ME to the directory containing your demultiplexed forward-read fastqs
pathR <- "REV" # CHANGE ME ...
filtpathF <- file.path(pathF, "filtered") # Filtered forward files go into the pathF/filtered/ subdirectory
filtpathR <- file.path(pathR, "filtered") # ...
fastqFs <- sort(list.files(pathF, pattern="fastq"))
fastqRs <- sort(list.files(pathR, pattern="fastq"))
if(length(fastqFs) != length(fastqRs)) stop("Forward and reverse files do not match.")
filterAndTrim(fwd=file.path(pathF, fastqFs), filt=file.path(filtpathF, fastqFs),
rev=file.path(pathR, fastqRs), filt.rev=file.path(filtpathR, fastqRs),
truncLen=c(240,150), maxEE=2, truncQ=11, maxN=0, rm.phix=TRUE,
compress=TRUE, verbose=TRUE, multithread=TRUE, matchIDs = TRUE)

#File parsing
filtpathF <- "FWD2/filtered"
filtpathR <- "REV2/filtered"
filtFs <- list.files(filtpathF, pattern="fastq.gz", full.names = TRUE)
filtRs <- list.files(filtpathR, pattern="fastq.gz", full.names = TRUE)
sample.names <- sapply(strsplit(basename(filtFs), ""), [, 1)
sample.namesR <- sapply(strsplit(basename(filtRs), ""), [, 1)
if(!identical(sample.names, sample.namesR)) stop("Forward and reverse files do not match.")
names(filtFs) <- sample.names
names(filtRs) <- sample.names
set.seed(100)

#Learn forward error rates
errF <- learnErrors(filtFs, nbases=1e8, multithread=TRUE)
#Learn reverse error rates
errR <- learnErrors(filtRs, nbases=1e8, multithread=TRUE)

#Sample inference and merger of paired-end reads
mergers <- vector("list", length(sample.names))
names(mergers) <- sample.names
for(sam in sample.names) {
cat("Processing:", sam, "\n")
derepF <- derepFastq(filtFs[[sam]])
ddF <- dada(derepF, err=errF, multithread=TRUE)
derepR <- derepFastq(filtRs[[sam]])
ddR <- dada(derepR, err=errR, multithread=TRUE)
merger <- mergePairs(ddF, derepF, ddR, derepR)
mergers[[sam]] <- merger
}
rm(derepF); rm(derepR)

#Construct sequence table and remove chimeras
seqtab <- makeSequenceTable(mergers)
saveRDS(seqtab, "seqtab_K1.rds")

Any help you can offer on why we would get such huge read count differences between the two methods would be really helpful.

Thank you!

thermokarst · September 23, 2021, 7:28pm

Hi @reige012!

All importing does is simply copy the files into a QZA file. If you didn't run any additional filtering or other qa/qc steps then the QZA file should be identical to what was imported.

You can confirm this by running the following command in your QIIME 2 environment and comparing it to the per-sample read counts in the viz generated by demux summarize:

cd /RawReadsSponge/reads
for f in *.fastq.gz; do r=$(( $(gunzip -c $f | wc -l | tr -d '[:space:]') / 4 )); echo $r $f; done

If you're still observing a discrepancy with your colleague's observation, my suggestion is to double-check their method for computing the raw reads is comparing the same thing as you, at the same steps.

system · October 25, 2021, 1:28am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.