Please Help! Truncation length and interpretation of denoising stats

ChrisKeefe · October 23, 2020, 9:00pm

You've asked many questions in the posts above. Rather than attempt to answer them all, I'm going to keep things high-level.

What is "good enough"?

Rather than accept some arbitrary external threshold for "good enough", let's assume your goal is to achieve "the best that is possible" after taking the limitations of your data and tools into account. A naive first objective would be to try to keep all the data. Sequencing is expensive, so why throw out data, right?

We have to modify this goal, because amplification and sequencing introduce "noise" into our data. Noisy data contains false (not representative of the actual sequence) or untrustworthy (has a low q-score, so we think it might be wrong) bits. We attempt to correct this by "denoising" it. Our goal with denoising, then, is to keep as much of the "true" read data as possible, while getting rid of the "false" or untrustworthy data. After all, you're not interested in arbitrary strings of ACT and G's. You care about the biological sequences those letters represent.

Filtering

Filtering removes sequences that contain untrustworthy data. If a sequence has a sufficiently low q score at position x, DADA2 removes the whole sequence. By trimming away the positions with low mean q-scores, we can prevent DADA2 from filtering out entire reads. This is a compromise - we are effectively deleting the untrustworthy parts of all of our sequences. By cutting off the bad bits, we save those reads from getting filtered out. By getting rid of some untrustworthy data, we keep more trustworthy sequences.

Joining paired-end reads

Joining paired-end reads allows us to truncate untrustworthy data from the "middle" of our reads, without losing the ends of those reads. It presents us with another compromise, though. As you're aware, if you cut too much out of the "middle" of your read, DADA2 won't have the information it needs to join it properly.

Remember, DADA2 is doing its best to give you the actual biological sequences from your samples. If you cut the middle out of a sequence, it can't reliably do that, so it again drops the whole sequence. Why? It's important that you can trust the data you have, even if that means you have less of it.

Truncation should be used to cut away untrustworthy data. Reads can only be joined if they contain trustworthy data (otherwise, they would have been dropped during filtering), and if the f and r sections together are longer than the target amplicon. This means that joined reads are generally good data, and represent the full amplicon. A good first goal, then, is to maximize the number of successfully merged reads, by truncating just enough bad positions that you keep the most true sequences.

`trim-left`

In an ideal world, your sequences begin at the beginning of the target amplicon, and end at its end. This lets you compare them to other sequences from the same amplicon easily, even across studies. For this reason, many people only use trim-left to cut away non-sequence data like primers. If you have enough bad-quality positions at the 5' ends of your reads that you are losing many samples/sequences to filtering, you can use trim-left to remove biological data. This is AOK, just remember it too is a compromise.

chimeras

Chimeras are not useful biological data. Increasing non-chimeric reads is good, because having more reads is good, and non-chimeric reads are potentially useful. Our goal is to get rid of the noise, so chimera removal is a generally a good thing.

Despite this, if you are losing more than 25% of your input reads as chimeras, take that as an indicator that you are probably doing something wrong. Often this means that there are ambiguous nucleotides in your data, because primer sequences are still included. Remove them in DADA2 with trim-left, or using a tool like q2-cutadapt. There are tons of posts on this forum about trimming primers. Start there, and feel free to create a new topic if you have a specific question.

Note: The samples you have screenshots of are not losing more than 20% to chimeras. This is probably not an issue for you.

If you still have high levels of chimera loss after you have removed all primer data, then you may just have a lot of chimeras in your data. Adjusting p-min-fold-parent-over-abundance might be a good choice for you - just keep the values reasonable, or you run the risk of including a lot of "fake" reads in your data, which could skew your study results. This post covers the topic in wonderful detail.

IIRC, wet lab processes (e.g. high cycle counts) can increase the number of chimeras present in sequence data. If it's not possible to recover useful numbers of non-chimeric sequences from your data without accepting too many "false postives" by cleaning up primers and increasing min-fold, you may need to consider your protocols.

What is Good enough (again)?

Many people, myself included, go looking for "the best" solution, when a "good enough" solution will be adequate. Based on your screenshots, it looks like you're capturing 10k reads from many of your samples after denoising. That might be enough sequencing depth to get you the statistical power you need. Every study is different, and only you know your study well enough to know if that's adequate. For many studies, optimizing sequencing depth is much less important that the results of downstream analysis. Once you've chosen parameters that get you a good number of merged reads, I'd just move on and see what you can learn from your data.