Just to be certain we are talking about the same thing, where do you get the number 5,000 reads/sample from? (I assume the second tab of the feature-table summary?). And yes often poor quality can result in losing data. Would you be able to provide the qiime demux summarize visualization for this data?
How many more reads did you get with a different max_ee value? And what value did you set? It may be the case that your data is just very noisy. In that case, it isn’t necessarily a bad thing that you have less reads, it just means the relative abundances of different ASVs are little more uncertain. So whether that is a problem is going to depend on what questions you are trying to answer with the dataset.
Also, is your data paired-end? An easy way to lose lots of sequences is if your reads don’t overlap enough.
Another possibility is that your real reads are being seen as chimeric (and then removed), which usually happens because your reads still have primers/adapters/non-biological sequence data attached. These need to be removed before processing (we don’t have any tooling to help this yet in QIIME 2, but cutadapt is nice).
These quality scores seem pretty good. It looks like something upstream is clipping the length (but that doesn’t happen to the vast majority of the reads) which is consistent with some basic quality control (probably the MiSeq/Casava?). To the best of my knowledge shouldn’t be a problem for DADA2.
But I would probably set a trim-left for this data because there is a pretty noticeable dip in the beginning.
I’m sorry I should have mentioned that you can usually use trim-left for this since the length of your non-biological data at the start of each read is known ahead of time. It’s the reverse-primers that QIIME 2 has trouble with at the moment, in which case cutadapt is a great way to handle the issue (usually this matters for ITS).
Given the difference in reads between max-ee=2 and max-ee=10 isn’t that much relative to the number of reads, my guess is your data is getting caught up in chimera detection. So you should probably set a trim-left-f/r that covers your non-biological data.
It looks like setting trim-left to just after the quality dip basically doubled the number of features!
The last thing that we can check is if denoise-single results in a great many more features than its paired counterpart. You can run it with the same demux.qza it will just only look at the forward reads. This let’s us tell if the merge step is problematic (but it seems unlikely, since max-ee and trim-left seem to be controlling the number of features we see).
I don’t think there is a hard or fast rule, 10 seems pretty high, but both 6 and 10 have similar feature distributions. Since you have these tables already (denoising is the step that takes the longest), you might try running some preliminary analysis on each to see if they say different things. @benjjneb, do you have any suggestions on maxEE?
maxEE of 10 is pretty high. There is not much value in pushing more high-error reads through, it’s usually better to trim off a bit more of the tails while keeping maxEE lower.
I’l also add: If adding trim-left increased the number of reads getting through by a lot, that probably means that you have primers at the start of your reads (trimming off primers will reduce the number of reads lost to spurious chimera detection due to the ambiguous nucleotides in the primers).
I can’t stress enough: Make sure your primers are removed! Primers are not biological nucleotides, and they usually contain ambiguous nucleotide positions. You could kind of get away with leaving primers on when making fuzzy OTUs. You can’t when you are calling exact sequences!
Yes, the drop-off of quality at the ends of the reads causes many reads to be lost to the quality filters. When you truncated earlier, you removed the worst parts of that tail and kept more reads.
In rough numbers, the size of your amplicon (from primer start positions) is ~425 nts. You need 30 nts of overlap to be safe, So you need 425+30~455 nts after truncation. That is, trunc-len-f + trunc-len-r > 455.
As long as that condition holds, you can reduce trunc-len to get more reads through the filter, and that is often the right choice when these low quality tails exist.