The dataset in question was generated on a run with very poor throughput, so I'm trying to salvage what was provided. Specifically, all I intend on doing with this current dataset is to assign taxonomy - a list of what taxa are likely present among all samples. No abundance considerations, no per-sample richness... just a simple list.
I'm starting with about 280,000 read pairs split among a few hundred samples that were generated from a 2x300 MiSeq platform run. Yep, there's just about 1000 reads total per sample on average. I began by importing the demultiplexed data into QIIME, and removed the primers with the cutadapt plugin. Then I ran dada2 using the parameters I've done in the past:
qiime dada2 denoise-paired \ --i-demultiplexed-seqs my.demux_seqs.qza \ --p-trunc-len-f 0 --p-trunc-len-r 0 \ --o-table my.table.qza --o-representative-sequences my.repSeqs.qza --o-denoising-stats my.denoisingStats.qza
yet generated an error:
Plugin error from dada2: No reads passed the filter. trunc_len_f (175) or trunc_len_r (175) may be individually longer...
That was a bit strange given the error profiles (see below).
A few things to note:
- The expected amplicon length of the forward and reverse reads differ: R1 should sum to 273 bp, but so there is about 27 bp of noise (which you can kind of make out in the above error profile). R2 should sum to 278, but the error profile is clearly tanking around 200 bp.
- The expected read isn't the same as the expected amplicon mentioned above, because what we're actually sequencing contains primer adapter at the 5' end of each R1 and R2 read, and adapter read through at the 3' end. When properly trimmed, both R1 and R2 should ultimately be reduced to just 181 bp.
And just to highlight the point here: I'm using the expected read, not the expected amplicon when running DADA2.
I thought perhaps my overlap wasn't working as expected so I reduced the
--p-trunc_len-r parameters, substituting
175 with values of
150 (same both
*-r). I get the same message.
I then wondered if I could just avoid the trimming entirely and, substituting the value with
0, and still get the same message.
I then went back to the drawing board and tried running
cutadapt manually so that I could visually confirm that I do indeed have the exact trimmed amplicons I intend, and indeed, my dataset is trimmed as expected.
Side note: it would be great to have an option to pass along the linked adapters within QIIME as described in cutadapts docs. As it stands now, I can trim these reads using a combination of the
--p-front-* parameters, but these aren't able to be recognized as linked in the QIIME implementation as far as I can tell
I can thus confirm that neither the standalone Cutadapt, nor the plugin version within QIIME made a difference - I get the same error no matter what I feed into DADA2.
Now for the weird part... + + ...
I can run DADA2 in single end mode and the process doesn't generate an error:
qiime dada2 denoise-single \ --i-demultiplexed-seqs my.trimd_seqs.qza \ --p-trunc-len 180 --o-table my.table.qza --o-representative-sequences my.repSeqs.qza --o-denoising-stats my.denoisingStats.qza
but when I look at the stats, I think it's passing because it's not denoising. There's no difference in the number of trimmed reads as there are the number of denoised reads for any sample.
And it's not like this is what I should do anyway - I'm performing a denoising step with paired data but implying single-end data. I certainly could join the PEs first, then run this script, which is my current plan at the moment...
One last thought: I took the standalone Cutadapt data (now just my reads, with all primers removed) and tried joining the read pairs with the QIIME installed version of Vsearch. And it worked just fine!
## "$i" represents one of the 288 samples... vsearch --threads 24 \ --fastq_mergepairs "$i"_L001_R1_001.trimd.fastq.gz \ --reverse "$i"_L001_R2_001.trimd.fastq.gz \ --fastq_minovlen 160 \ --fastq_maxdiffs 15 \ --fastqout "$i".merged.fastq;
One peculiar thing to note: at the end of this merging step, 1 of the 288 samples contains zero reads. The remaining 277 samples all contain > 0 reads, and after BLASTing a random sampling of those reads it can confirm they're the kind of data I'd expect.
I'm unclear why DADA2 doesn't see amplicons long enough to overlap when the same kinds of data in previous runs have been successfully overlapped. I'm wondering if it has something to do with the fact that there are so few reads per sample. The temporary
.log file produced in incorrect runs suggests that all samples aren't overlapping, but perhaps there's a bug in the script where if one sample results with zero reads, then it goes haywire and the system interprets this as every sample having no data?
It's all dark magic to me.
Thanks for helping me get to feeling like : with my data...