Hello
Thank you for all the help and support in this forum! I seem to be losing a lot of reads during denoising. I've checked a couple of posts that have suggested only using fwd reads, making sure primers are removed, and making sure there's enough overhang for merging. I think I've addressed these potential issues. I have paired end reads (2 x 251 kit, demultiplexed, 515f and 926r primers for v4-v5). I used cutadapt to remove the primers:
cutadapt \
-g ^GTGYCAGCMGCCGCGGTAA \
-G ^CCGYCAATTYMTTTRAGTTT \
-m 209 -M 255 \
--too-short-output untrimmed/${sample}_R1_too-short.fastq.gz \
--too-long-output untrimmed/${sample}_R1_too-long.fastq.gz \
--untrimmed-output untrimmed/${sample}_R1_untrimmed.fastq.gz \
--too-short-paired-output untrimmed/${sample}_R2_too-short.fastq.gz \
--too-long-paired-output untrimmed/${sample}_R2_too-long.fastq.gz \
--untrimmed-paired-output untrimmed/${sample}_R2_untrimmed.fastq.gz \
-o trimmed_fastq/${sample}_R1_001.fastq.gz -p trimmed_fastq/${sample}_R2_001.fastq.gz \
fastqfiles/${sample}_R1_001.fastq.gz fastqfiles/${sample}_R2_001.fastq.gz \
>> cutadapt_primer_trimming_stats.txt 2>&1
Looking at the cutadapt parameters, the expected number of bases are removed. I then used the following parameters for denoising:
qiime dada2 denoise-paired \
--i-demultiplexed-seqs run-1-demux-paired-end.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 226 \
--p-trunc-len-r 200 \
--p-trunc-q 2 \
--p-max-ee-f 2 \
--p-max-ee-r 2 \
--p-chimera-method consensus \
--p-hashed-feature-ids TRUE \
--o-representative-sequences run-1-rep-seqs.qza \
--o-table run-1-table.qza \
--o-denoising-stats run-1-stats.qza
My demux data is here - lower quality than some of my other runs, but hopefully not horrible:
run-1-demux-paired-end.qzv (306.4 KB)
The denoising stat summary is here:
run-1-denoising-stats.qzv (1.2 MB). My percentage of non-chimeric input ranges from ~20% - 75% with most in the 35%-55% range. Percent of input merged ranges from 55% - 85%
I also seem to be losing a lot of reads with better quality runs: run-3-demux-paired-end.qzv (304.3 KB)
qiime dada2 denoise-paired \
--i-demultiplexed-seqs run-3-demux-paired-end.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 228 \
--p-trunc-len-r 223 \
--p-trunc-q 2 \
--p-max-ee-f 2 \
--p-max-ee-r 2 \
--p-chimera-method consensus \
--p-hashed-feature-ids TRUE \
--o-representative-sequences run-3-rep-seqs.qza \
--o-table run-3-table.qza \
--o-denoising-stats run-3-stats.qza
that didn't seem to have much of an effect: run-3-denoising-stats.qzv (1.2 MB). In this run ~20% - 80% were non chimeric with most in the 30 - 40% range. Here 50% - 80% merged, so I'm not sure it has to do with low quality scores. Merging looks ok as well, so I think overhangs are sufficient.
- Is it possible that cutadapt wasn't very effective and I might have a lot of primers hanging around? Maybe I shouldn't have anchored the removal to the 5' end in case there were a few early insertions?
- Are my denoising parameters too strict?
- Am I miss-reading the report: Each of the summary statistics (percent that pass through filtering, percent that pass through denoising and merging, and percent that are non-chimeric are all given with respect to the total number of input reads, so I don't know really how many reads passed through to the 'next step.' The input for each step doesn't match the output of the previous.) Am I correct for the first row of the run030denoising-stats (sorted by total input) to say that I started with 55871 reads, 49324 of those passed the initial filtering, of those 49324 filtered reads, 47217 passed through denoising, of those 47217 denoised, 35438 were successfully merged, and of those 35438 that merged only 20809 were non-chimeric?
- What part of the denoise-paired command constitutes filtering vs denoising? Is the filtering based on the trunc-len and trunc-q parameters and the denoising based on the allowed expected error (EE) with the ad-hoc error model applied?
Thank you - apologies for the long post and all the questions. Definitely feels like as soon as I feel I'm starting to understand, I have so many more questions!