I'm currently trying to process my paired-end demultiplexed MiSeq libraries (primers 341F & 785R, targeting the V3-V4 16S region) with DADA2 and I've been getting low frequency count - I've started with 11,077,765 sequences but obtained 740,243 sequences after DADA2 (~7%). My quality plots are shown below:
Thanks for the details about your issue!
You're right that the output does seem rather low but this just may be the nature of your data and DADA2 is doing it's job properly.
With the 785R, 341F primers we're expecting an amplicon size of ~ 444bp (785-341) so with 2x300 read cycle you should have an overlap of roughly 600-444= 156bp. So we want to make sure our truncating length doesn't go over 156 - (20 bp minimum overlap required + 20 bp natural variation to be safe) = 116 bp. So based on that calculation I would say both your scenarios are logical with their truncating parameters. What I suspect is happening however is that the the quality of your reverse reads are dropping low enough for dada2 to drop them due to low quality. This makes sense considering your second attempt kept more of the 3' tail of the reverse reads which allowed more poor quality reads, so more likely for a read to be dropped.
Can you share the result of your denoising-stats.qza? This should tell us a bit more about what is happening.
An easy solution would be to discard the reverse reads and just denoise the forward reads since they are in pretty good shape, this should yield much higher reads though at the cost of shorter reads.
Thank you so much @Mehrbod_Estaki for your help and detailed explanations!
Please find attached my denoising stats for the first run (p-trim-left-f 23, p-trim-left-r 40, p-trunc-len-f 300, p-trunc-len-r 193): denoising-stats.qzv (1.2 MB)
And for the second run (p-trim-left-f 22, p-trim left-r 39, p-trunc-len-f 300, p-trunc-len-r 276): denoising-stats2.qzv (1.2 MB)
I've actually underwent a third attempt running DADA2, but ended up with 17,801 sequences! The parameters for the run were set to:
If the quality of my reverse reads are too low, I will follow your suggestion in only denoising my forward reads and continue my analysis as single-ended data.
Thanks for sharing those stats! These support our suspicion that the reverse reads are causing lots of reads to be filtered out initially. In the first scenario you're merging a lot more reads because you've trimmed much of the poor quality tail of the reverse reads. In scenario 2 there's a lot more reads initially being discarded since the quality reads start to dip quite a bit by position 276, so your initial pool to denoise/merge is low. Scenario 3 has the highest number of reads being brought forward since the truncating parameters are pretty stringent, but trimming that much of course leads to insufficient overlap so most reads can't merge properly.
If you absolutely must keep paired-ends, then one last attempt would be to relax the maxEE parameters to lets say 5 (as suggested in the DADA2 tutorial). This should increase the number of reads that initially make it through. You can probably improve the error rates a bit by increasing the --p-n-reads-learn INTEGER though I believe the benefits would be limited.
Otherwise, I'm willing to bet using the forwards only and trimming by 40 and truncating at 280 would give you much higher output than all the other scenarios.
I'm happy things seem to have worked out! I do have a concern I want to clarify with you though. In the final result you posted you show over 100,000 unique features in your 6.6. million reads! This seems rather high to me, unless you are actually looking at some very diverse range of samples?
The other possibilities we want to check are a) Have all your non-biological sequences been removed from your reads prior to running dada2? Like your primers, adaptors and barcodes? We often see inflated features # when these are not removed. Lastly, if you are running the forward reads only, you should revert the --p-max-ee back to default since we aren't worried about the number of reads we would bring forward in this scenario so we want to be very strict with our quality of reads.
Perhaps this is a non-issue but we wanted to make sure before moving forward!