Please Help! Truncation length and interpretation of denoising stats

Lacona · October 12, 2020, 6:44pm

Hey all,

Yesterday, I imported my very first own dataset into qiime2.
After reading quite some threads about trimming and truncating I tried to do it today with my data. Expected sequence length is 400-450bp.

First try:

qiime dada2 denoise-paired \
> --i-demultiplexed-seqs 1_16_1.qza \
> --p-trim-left-f 9 \
> --p-trim-left-r 7 \
> --p-trunc-len-f 240 \
> --p-trunc-len-r 210 \
> --o-table table_1.qza \
> --o-representative-sequences rep_1.qza \
> --o-denoising-stats denoise_1.qza

Second try:

qiime dada2 denoise-paired \
> --i-demultiplexed-seqs 1_16_1.qza \
> --p-trim-left-f 9 \
> --p-trim-left-r 7 \
> --p-trunc-len-f 240 \
> --p-trunc-len-r 250 \
> --o-table table_2.qza \
> --o-representative-sequences rep_2.qza \
> --o-denoising-stats denoise_2.qza

Regarding the denoising-stats I would prefer the first try, because I can see a higher percentage passed the input filter and I have a higher percentage of input merged.
But I can't say, if the outcome is good enough to proceed and how to interpret the input non-chimeric?!
Yes, I searched here for "interpretation of dada2 stats" but I couldn't find sufficient basic information.

Also, on the top of the denoising-stats files is a possibility to download a metadata.tsv file. Can I use it instead doing the sample-metadata.tsv by myself?

Thank you!

Update 1:
I have 1 control group and 3 treatment groups, 2 technical replicates.
All of them show pretty the same Interactive Quality Plot. Why? Did I do something wrong?

Update 2:
4. Trimming forward 9 and reverse 7
Truncating forward 200 and reverse 200 obviously has gap issues

Trimming forward 10 and reverse 10
Truncating forward 240 and reverse 210

Trimming forward 15 and reverse 15
Truncating forward 240 and reverse 210

Trimming forward 20 and reverse 20
Truncating forward 240 and reverse 210

Trimming forward 25 and reverse 25
Truncating forward 240 and reverse 210

Trimming forward 30 and reverse 30
Truncating forward 240 and reverse 210

Trimming forward 50 and reverse 50
Truncating forward 240 and reverse 210

Trimming forward 100 and reverse 100
Truncating forward 240 and reverse 210

Trimming forward 10 and reverse 10
Truncating forward 220 and reverse 210

seems to be a good decision, but!
Please, can someone explain to me, what is going on here? It seems that the results are getting better the more I trim-left. That can't be right?! How shall I decide where to trim then?!

Lacona · October 13, 2020, 11:09pm

Is there no-one who can help me with this?

Because I was so unsure where to trim, I simply went to my primer sequences and counted them.

Trimming forward 19 and reverse 20
Truncating forward 220 and reverse 210

Are these numbers good enough to proceed? Do I have to be concerned about the risen percentage of non-chimera in comparison to 12.?

Thank you.

thermokarst · October 13, 2020, 11:21pm

Hi @Lacona - please be patient, someone will be with you as soon as possible. In the meantime, I request that you re-familiarize yourself with the forum code of conduct:

https://forum.qiime2.org/faq#patience

Thanks so much for your understanding and support!

andrewsanchez · October 14, 2020, 12:18am

Hi, @lacona

Congrats on importing your first data set into :qiime2:!

While I applaud your effort in testing out so many different combinations here, I think you really need to take a step back and make sure you understand how to choose trimming and truncation parameters based on your quality plot before you proceed.

Let's start with trimming.

We can see by running qiime dada2 denoise-paried --help or viewing the documenation here, trimming is used to trim off the first m bases of each sequence.

  --p-trim-left-f INTEGER
                         Position at which forward read sequences should be
                         trimmed due to low quality. This trims the 5' end of
                         the input sequences, which will be the bases that
                         were sequenced in the first cycles.      [default: 0]
  --p-trim-left-r INTEGER
                         Position at which reverse read sequences should be
                         trimmed due to low quality. This trims the 5' end of
                         the input sequences, which will be the bases that
                         were sequenced in the first cycles.      [default: 0]

With that in mind, can you see why systematically testing dada2 stats generated by trimming off more and more of the 5' end doesn't really make sense, given the quality plot you shared? Your goal should not be to generate "better," dada2 stats, but to select the appropriate trim/trunc parameters based on your quality plot.

Please also take a look at the FAQ for dada2, especially the section about chimeras.

https://benjjneb.github.io/dada2/faq.html

I hope this helps!

Lacona · October 14, 2020, 2:32am

Thank you!

That's what I did after reading your comment:

I read through all the links and I also had a look at my sequences.
They show that the primers are still there and that a percentage around 75% for non-chimeric should be the goal. Am I right?
I thought that one possibility to trim the primers would be trim-f 19 (= primer length) and trim-r 20 (=primer length) in the dada2 denoising step. But the percentage of non-chimeric is only around 50%.
I couldn't make trimLeft = c(19, 20) from Benjamin Callahan's FAQ work, so I don't know if that code would generate other results.
I will try cutadapt also, but that will take some time.

No, I don't understand, why I have "better" results by trimming 100 r and f. I would expect a worse outcome by cutting off so much of the sequence?

I also read quite a lot about truncating, which ended in choosing f220 and r210 = expected length of 400 and around 30 for a good overlapping. I tried also f210 and r210, but there was almost no merging any more, which means that overlapping was not sufficient. Truncating at 210 and 240 generated longer pieces, but longer pieces mean higher probability of errors, and lower probability for 1% error free reads.
So, I would think my truncating should be fine?

Again, thank you so much! It was very helpful and I will try cutadapt, now.

cutadapt update:

qiime cutadapt trim-paired \
> --i-demultiplexed-sequences 16_1_1.qza \
> --p-adapter-f GTGCCAGCMGCCGCGGTAA \
> --p-adapter-r CCGYCAATTYMTTTRAGTTT \
> --o-trimmed-sequences cutadapt_1.qza

Based on the Quality Plot:

qiime dada2 denoise-paired \
--i-demultiplexed-seqs cutadapt_1.qza \
  --p-trim-left-f 5 \
  --p-trim-left-r 5 \
  --p-trunc-len-f 250 \
  --p-trunc-len-r 200 \
  --o-table table_15.qza \
  --o-representative-sequences rep_15.qza \
  --o-denoising-stats denoise_15.qza

verbose:=== Summary ===

Total read pairs processed: 23,742
Read 1 with adapter: 22,526 (94.9%)
Read 2 with adapter: 22,510 (94.8%)
Pairs written (passing filters): 878 (3.7%)

Total basepairs processed: 14,292,684 bp
Read 1: 7,146,342 bp
Read 2: 7,146,342 bp
Total written (filtered): 517,919 bp (3.6%)
Read 1: 253,706 bp
Read 2: 264,213 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 3'; Length: 19; Trimmed: 22526 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1

Bases preceding removed adapters:
A: 0.0%
C: 0.0%
G: 0.3%
T: 0.0%
none/other: 99.7%

Overview of removed sequences
length count expect max.err error counts
3 4 371.0 0 4
4 1 92.7 0 1
6 1 5.8 0 1
12 1 0.0 1 0 1
133 1 0.0 1 1
152 1 0.0 1 1
194 3 0.0 1 3
195 1 0.0 1 1
196 3 0.0 1 3
197 34 0.0 1 31 3
198 2 0.0 1 1 1
199 14 0.0 1 10 4
200 3 0.0 1 3
299 1 0.0 1 0 1
300 6 0.0 1 5 1
301 22450 0.0 1 18777 3673

=== Second read: Adapter 2 ===

Sequence: CCGYCAATTYMTTTRAGTTT; Type: regular 3'; Length: 20; Trimmed: 22510 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Bases preceding removed adapters:
A: 0.0%
C: 0.1%
G: 0.0%
T: 0.1%
none/other: 99.8%

Overview of removed sequences
length count expect max.err error counts
3 10 371.0 0 10
4 14 92.7 0 14
5 3 23.2 0 3
269 1 0.0 2 0 0 1
300 18 0.0 2 8 8 2
301 22464 0.0 2 14428 6759 1277
=== Summary ===

Total read pairs processed: 17,794
Read 1 with adapter: 17,003 (95.6%)
Read 2 with adapter: 16,956 (95.3%)
Pairs written (passing filters): 581 (3.3%)

Total basepairs processed: 10,711,988 bp
Read 1: 5,355,994 bp
Read 2: 5,355,994 bp
Total written (filtered): 342,705 bp (3.2%)
Read 1: 167,872 bp
Read 2: 174,833 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 3'; Length: 19; Trimmed: 17003 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1

Bases preceding removed adapters:
A: 0.0%
C: 0.0%
G: 0.3%
T: 0.0%
none/other: 99.6%

Overview of removed sequences
length count expect max.err error counts
3 5 278.0 0 5
4 4 69.5 0 4
9 1 0.1 0 1
126 1 0.0 1 1
127 1 0.0 1 0 1
128 1 0.0 1 1
132 2 0.0 1 1 1
152 1 0.0 1 1
194 4 0.0 1 4
195 3 0.0 1 2 1
196 1 0.0 1 1
197 26 0.0 1 20 6
198 4 0.0 1 3 1
199 1 0.0 1 1
200 5 0.0 1 3 2
300 3 0.0 1 0 3
301 16940 0.0 1 14137 2803

=== Second read: Adapter 2 ===

Sequence: CCGYCAATTYMTTTRAGTTT; Type: regular 3'; Length: 20; Trimmed: 16956 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Bases preceding removed adapters:
A: 0.0%
C: 0.0%
G: 0.0%
T: 0.0%
none/other: 99.9%

Overview of removed sequences
length count expect max.err error counts
3 7 278.0 0 7
4 10 69.5 0 10
11 1 0.0 1 0 1
300 3 0.0 2 0 2 1
301 16935 0.0 2 10581 5279 1075
=== Summary ===

Total read pairs processed: 15,656
Read 1 with adapter: 14,964 (95.6%)
Read 2 with adapter: 14,894 (95.1%)
Pairs written (passing filters): 538 (3.4%)

Total basepairs processed: 9,424,912 bp
Read 1: 4,712,456 bp
Read 2: 4,712,456 bp
Total written (filtered): 318,505 bp (3.4%)
Read 1: 156,921 bp
Read 2: 161,584 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 3'; Length: 19; Trimmed: 14964 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1

Bases preceding removed adapters:
A: 0.0%
C: 0.0%
G: 0.2%
T: 0.0%
none/other: 99.7%

Overview of removed sequences
length count expect max.err error counts
3 4 244.6 0 4
5 1 15.3 0 1
127 1 0.0 1 1
128 1 0.0 1 1
147 1 0.0 1 1
194 6 0.0 1 6
195 2 0.0 1 2
197 15 0.0 1 14 1
198 1 0.0 1 1
199 1 0.0 1 1
200 6 0.0 1 6
300 5 0.0 1 1 4
301 14920 0.0 1 12796 2124

=== Second read: Adapter 2 ===

Sequence: CCGYCAATTYMTTTRAGTTT; Type: regular 3'; Length: 20; Trimmed: 14894 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Bases preceding removed adapters:
A: 0.0%
C: 0.0%
G: 0.0%
T: 0.1%
none/other: 99.8%

Overview of removed sequences
length count expect max.err error counts
3 13 244.6 0 13
4 6 61.2 0 6
5 1 15.3 0 1
7 1 1.0 0 1
300 4 0.0 2 0 1 3
301 14869 0.0 2 9463 4533 873

Lacona · October 14, 2020, 2:39am

Oh, I am so sorry and I have to apologize!
To be honest, I wasn't expecting qiime developers answering my very basic level questions. I thought, other users would help me... and after my thread had about 18 views and no answer, I thought I maybe expressed myself not clear enough.
Thank you for the link, it was very helpful and eye-opening!!!

andrewsanchez · October 14, 2020, 11:32pm

I think you are answering your own question here, @Lacona. Can you really say these results are better if you are trimming away so many (high quality) read sequences?

A few links to help troubleshoot issues with low non-chimeric reads were recently shared in a response here.

Lacona · October 15, 2020, 12:10am

@andrewsanchez:

Thank you for not giving me the answers right away, but pushing me in the right direction! That will help a lot with my understanding of what is happening and the analysis of the rest of my data.

I know that the results can't be really better, when I cut off half of the sequence.
But why I get these stats, then:
Trimming forward 19 and reverse 210
Truncating forward 220 and reverse

Bild 13.10.20 um 17.501262×270 48.7 KB

Trimming forward 100 and reverse 100
Truncating forward 220 and reverse 210

Bild 14.10.20 um 19.001124×238 41 KB
I went through all of these links and even more to understand, what exactly I should do to reduce my chimera. Although I am still not sure if I have to worry about my chimera at all, because maybe my non-chimeric percentage is sufficient? The primers should be trimmed with dada2 denoise, p-trim-left f19 and r20.
After all that research, I wanted to compare cutadapt/dada2 stats with only dada2 stats. Therefore, tried to cut the exact primer sequence with the cutadapt plugin, but then, dada2 stats showed ... nothing any more (see my last post).
And I can't figure out why or what happened or how to fix it or if it is even necessary to use cutadapt?

Lacona · October 23, 2020, 4:50pm

After adding --p-min-fold-parent-over-abundance 4 to my dada2 code, I got better results, even if I don't know if that is the right way to get rid of chimera.

ChrisKeefe · October 23, 2020, 8:30pm

Thanks for your patience, @Lacona. It's been a busy time around here.
Before I respond to your question, I need to ask you to work on focusing your topics. It can be very difficult to figure out what your actual question is, because there is just so much going on here. What are the fewest words you can use, and the least information you can provide, that will make it absolutely clear what you have done, and what you are struggling to understand. The screen captures and commands you share help with this, but the whole picture is pretty overwhelming.

Working to restrict your topics to one question will help with this. In any good discussion, related things will naturally come up. This is great, but unless they're closely tied to the original topic, please consider framing them as individual questions in a separate topic. For example, cutadapt questions should generally not be asked/answered in a topic about DADA2 denoising stats and trunc length unless they're very relevant and rather brief.

In the interest of wrapping up this long/complicated thread, I'm going to attempt to answer your most relevant questions at the conceptual level. Please feel free to respond here if there's anything you don't understand about my response, and please post further questions in their own concise topics.

ChrisKeefe · October 23, 2020, 9:00pm

You've asked many questions in the posts above. Rather than attempt to answer them all, I'm going to keep things high-level.

What is "good enough"?

Rather than accept some arbitrary external threshold for "good enough", let's assume your goal is to achieve "the best that is possible" after taking the limitations of your data and tools into account. A naive first objective would be to try to keep all the data. Sequencing is expensive, so why throw out data, right?

We have to modify this goal, because amplification and sequencing introduce "noise" into our data. Noisy data contains false (not representative of the actual sequence) or untrustworthy (has a low q-score, so we think it might be wrong) bits. We attempt to correct this by "denoising" it. Our goal with denoising, then, is to keep as much of the "true" read data as possible, while getting rid of the "false" or untrustworthy data. After all, you're not interested in arbitrary strings of ACT and G's. You care about the biological sequences those letters represent.

Filtering

Filtering removes sequences that contain untrustworthy data. If a sequence has a sufficiently low q score at position x, DADA2 removes the whole sequence. By trimming away the positions with low mean q-scores, we can prevent DADA2 from filtering out entire reads. This is a compromise - we are effectively deleting the untrustworthy parts of all of our sequences. By cutting off the bad bits, we save those reads from getting filtered out. By getting rid of some untrustworthy data, we keep more trustworthy sequences.

Joining paired-end reads

Joining paired-end reads allows us to truncate untrustworthy data from the "middle" of our reads, without losing the ends of those reads. It presents us with another compromise, though. As you're aware, if you cut too much out of the "middle" of your read, DADA2 won't have the information it needs to join it properly.

Remember, DADA2 is doing its best to give you the actual biological sequences from your samples. If you cut the middle out of a sequence, it can't reliably do that, so it again drops the whole sequence. Why? It's important that you can trust the data you have, even if that means you have less of it.

Truncation should be used to cut away untrustworthy data. Reads can only be joined if they contain trustworthy data (otherwise, they would have been dropped during filtering), and if the f and r sections together are longer than the target amplicon. This means that joined reads are generally good data, and represent the full amplicon. A good first goal, then, is to maximize the number of successfully merged reads, by truncating just enough bad positions that you keep the most true sequences.

`trim-left`

In an ideal world, your sequences begin at the beginning of the target amplicon, and end at its end. This lets you compare them to other sequences from the same amplicon easily, even across studies. For this reason, many people only use trim-left to cut away non-sequence data like primers. If you have enough bad-quality positions at the 5' ends of your reads that you are losing many samples/sequences to filtering, you can use trim-left to remove biological data. This is AOK, just remember it too is a compromise.

chimeras

Chimeras are not useful biological data. Increasing non-chimeric reads is good, because having more reads is good, and non-chimeric reads are potentially useful. Our goal is to get rid of the noise, so chimera removal is a generally a good thing.

Despite this, if you are losing more than 25% of your input reads as chimeras, take that as an indicator that you are probably doing something wrong. Often this means that there are ambiguous nucleotides in your data, because primer sequences are still included. Remove them in DADA2 with trim-left, or using a tool like q2-cutadapt. There are tons of posts on this forum about trimming primers. Start there, and feel free to create a new topic if you have a specific question.

Note: The samples you have screenshots of are not losing more than 20% to chimeras. This is probably not an issue for you.

If you still have high levels of chimera loss after you have removed all primer data, then you may just have a lot of chimeras in your data. Adjusting p-min-fold-parent-over-abundance might be a good choice for you - just keep the values reasonable, or you run the risk of including a lot of "fake" reads in your data, which could skew your study results. This post covers the topic in wonderful detail.

IIRC, wet lab processes (e.g. high cycle counts) can increase the number of chimeras present in sequence data. If it's not possible to recover useful numbers of non-chimeric sequences from your data without accepting too many "false postives" by cleaning up primers and increasing min-fold, you may need to consider your protocols.

What is Good enough (again)?

Many people, myself included, go looking for "the best" solution, when a "good enough" solution will be adequate. Based on your screenshots, it looks like you're capturing 10k reads from many of your samples after denoising. That might be enough sequencing depth to get you the statistical power you need. Every study is different, and only you know your study well enough to know if that's adequate. For many studies, optimizing sequencing depth is much less important that the results of downstream analysis. Once you've chosen parameters that get you a good number of merged reads, I'd just move on and see what you can learn from your data.

system · November 24, 2020, 4:36am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.