I have 16S V4 amplicon sequences using 515F–806R primers. Samples are extracted from soil. There are about 6 million reads across 74 samples. Reads are paired end 2x300bp. Running QIIME2 2019.10 in a conda environment.
When I run DADA2 I’m finding I lose a lot of reads - around 80% by the end of the chimera removal step.
The biggest drop is during filtering when I lose 50%.
What I’ve done already:
I’ve read through the DADA2 tutorial and three QIIME2 tutorials (Atacama, Moving Pictures, and FMT) and several forum posts.
- trimmed reads need to overlap enough to merge. As I understand it, the V4 regions is generally less than 400 bp so my trimmed lengths below are pushing the limits and I probably shouldn’t shorten them anymore. I’ve tried not trimming at all, but I get fewer reads in the end (which makes sense because more errors).
- Shortening the reads should remove more errors and increase the number of reads. This seems true - when I shortened the reverse reads from 160 to 140 I squeezed in a couple more percentage points of reads.
- Increasing the
--p-max-eeshould increase the number of reads. I’m hesitant to relax this parameter, but I could still try that.
- In some circumstances, tossing the reverse reads (if they’re low quality) and just keeping the forward reads can increase coverage because you lose fewer from merging. Haven’t tried this yet.
qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-f 0 --p-trim-left-r 6 --p-trunc-len-f 290 --p-trunc-len-r 140 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza --p-n-threads 6 --p-max-ee-f 2 --p-max-ee-r 2
I have already read through some forum posts:
With this one, I’m using the default
–p-trunc-q value so I don’t think it applies. Although, they do suggest just using the forward reads.
This one says “9000 sequences is plenty” so maybe my 10,000-20,000 reads per sample is “fine.” However, I don’t feel good about that metric. In these highly diverse communities I would like to accurately represent the diversity without wantonly tossing reads.
- Is losing 80% of reads high or is that typical? What about 50% during filtering?
- Are my trimming parameters leaving enough overlap and could I trim them more to reduce errors?
- Should I try adjusting max-ee or is that frowned upon? Will that hurt my merge step?
- Would my data be worth running just the forward reads and not trying to merge forward and reverse?
- Anything else I haven’t considered?