My setup:
I have 16S V4 amplicon sequences using 515F–806R primers. Samples are extracted from soil. There are about 6 million reads across 74 samples. Reads are paired end 2x300bp. Running QIIME2 2019.10 in a conda environment.
Problem:
When I run DADA2 I'm finding I lose a lot of reads - around 80% by the end of the chimera removal step.
The biggest drop is during filtering when I lose 50%.
What I've done already:
I've read through the DADA2 tutorial and three QIIME2 tutorials (Atacama, Moving Pictures, and FMT) and several forum posts.
- trimmed reads need to overlap enough to merge. As I understand it, the V4 regions is generally less than 400 bp so my trimmed lengths below are pushing the limits and I probably shouldn't shorten them anymore. I've tried not trimming at all, but I get fewer reads in the end (which makes sense because more errors).
- Shortening the reads should remove more errors and increase the number of reads. This seems true - when I shortened the reverse reads from 160 to 140 I squeezed in a couple more percentage points of reads.
- Increasing the
--p-max-ee
should increase the number of reads. I'm hesitant to relax this parameter, but I could still try that. - In some circumstances, tossing the reverse reads (if they're low quality) and just keeping the forward reads can increase coverage because you lose fewer from merging. Haven't tried this yet.
Command:
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f 0
--p-trim-left-r 6
--p-trunc-len-f 290
--p-trunc-len-r 140
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
--p-n-threads 6
--p-max-ee-f 2
--p-max-ee-r 2
I have already read through some forum posts:
With this one, I'm using the default –p-trunc-q
value so I don't think it applies. Although, they do suggest just using the forward reads.
This one says "9000 sequences is plenty" so maybe my 10,000-20,000 reads per sample is "fine." However, I don't feel good about that metric. In these highly diverse communities I would like to accurately represent the diversity without wantonly tossing reads.
Questions:
- Is losing 80% of reads high or is that typical? What about 50% during filtering?
- Are my trimming parameters leaving enough overlap and could I trim them more to reduce errors?
- Should I try adjusting max-ee or is that frowned upon? Will that hurt my merge step?
- Would my data be worth running just the forward reads and not trying to merge forward and reverse?
- Anything else I haven't considered?
My data:
demux.qzv (297.3 KB)
denoising-stats.qzv (1.2 MB)
Thank you!