Help with understanding demux.qzv output and selecting --p-trunc-len and --p-trim-left values

JoaoGabrielMoraes · June 19, 2020, 8:38pm

Hello,

Attached is the summary of my results after demultiplexing. demux.qzv (291.2 KB)

I am not sure which values to use for --p-trim-left and --p-trunc-len for running DADA2.

Can someone please give me some directions on that?

Thank you, Joao

JoaoGabrielMoraes · June 20, 2020, 12:39am

Hello,

The data that I posted above was before trimming the adaptors. Please see the data after trimming off adaptors. trimmed-paired-end-demux.qzv (296.6 KB)

I would appreciate some comments regarding the values to use for --p-trim-left and --p-trunc-len in DADA2.

Thank you, Joao

Mehrbod_Estaki · June 22, 2020, 3:04am

Hi @JoaoGabrielMoraes,
Picking DADA2 trim/truncating values is a very common question on the forum and luckily there are lots of previous answers and explanation floating around. Please have a quick search on the forum first using some key words like ‘DADA2’, ‘parameters’. there are some comprehensive answers lying around there. Try your luck there and then if you still have questions, please let us know.

JoaoGabrielMoraes · June 22, 2020, 9:12pm

Dear @Mehrbod_Estaki,

Thank you for giving me directions to learn more about DADA2. I’ve looked through the forum but I still have some unanswered questions. For example, from what I’ve read, a quality score above 30 is of high quality.

Looking at my plots (figure above), I can see that in the forward reads, the quality scores drops below 30 after 180 sequence bases. Because Dada2 needs 20+ bp overlap, I am thinking about truncate my forward reads at 180 bases and my reverse reads at 200. Is that correct?

I am not sure if I am being too strict. In the moving pictures tutorial, reads were only trimmed at 120, but sequences dropped below 30 at about 100, and at 120 bases, quality scores were extremely low. Am I being too strict selecting only reads with quality score above 30?

Thank you, Joao

Mehrbod_Estaki · June 22, 2020, 11:22pm

Hi @JoaoGabrielMoraes,
Quality scores of 20+ are generally considered good quality. 30 is high yes, but that doesn't necessarily mean you need restrict yourself to that setting.

This has actually been changed to 12 nts in more recent version of q2-dada2.

We need more info to be able to suggest anything here because this really depends on the overlap region of your amplicons. Search again through the forum, there are lots of examples of how this overlap region is calculated, and other considerations you should be aware of. Ex here.

While it is generally a good idea to keep only high quality reads for downstream analyses (garbage in, garbage out) we have come a long way on relying so heavily on quality scores. In your case I would think you are perhaps being a bit too strict, and you could eye ball the median scores of 20 as a starting point. Obviously you can truncate more if you can afford it based on your overlap region, but I think you can certainly get away with a less strict cut off.

JoaoGabrielMoraes · June 23, 2020, 3:45am

Dear @Mehrbod_Estaki,

Thank you for your prompt response and for pointing me in the right direction.

To generate our 16S data, we used 515F-806R primers which according to the earthmicrobiome website (https://earthmicrobiome.org/protocols-and-standards/16s/) generates an amplicon of ~390 bp. According to the plots I posted, our amplicons were paired-end sequenced (2x250 cycle). So there is about 110 bases overlap.

Based on your response, I think it would be better then to truncate my forward and reverse sequences at 230. This would give me ~70 bp overlap, which would be more than enough. Please let me know your thoughts on that.

Once again, thank you very much for your help!

Mehrbod_Estaki · June 23, 2020, 7:12am

Hi @JoaoGabrielMoraes,
That would certainly work, if you want to retain even more reads, I would truncate even a bit more, especially from the reverse reads. The more of poor quality tails you remove, the less likely the reads will be discarded during the initial filtering steps. So optimizing this step will involve you truncating as much as of poor quality tails as you can without compromising merging. So while as you say the 70 bp is safe, if it was me, I would truncate a bit more so that I had about 30-40 bp overlap.

JoaoGabrielMoraes · June 23, 2020, 7:36am

Understood, thank you @Mehrbod_Estaki!!

JoaoGabrielMoraes · June 23, 2020, 7:55pm

Dear @Mehrbod_Estaki,

Before our discussion, I had run DADA2 with --p-trunc-len-f set as 180 and --p-trunc-len-r set as 200. Because the amplicon generated by the primers I used (515F-806R) is ~390 bp, merging should've been really bad, as there was minimal overlapping of forward and reverse reads.

In my next try, following our conversation, I set --p-trunc-len-f as 215 and --p-trunc-len-r as 210, so there was an overlap of ~37 bp. Thus, I though that this second run would generate better results.

However, looking at the DADA2 denoising stats, the first run which had no overlap, actually generated better results. Please see attached results from run1 and run2.

Run-1_denoising-stats-dada2.qzv (1.2 MB)

Run-2_denoising-stats-dada2.qzv (1.2 MB)

Any thoughts on that?

Thank you in advance, Joao

Mehrbod_Estaki · June 23, 2020, 9:15pm

Hi @JoaoGabrielMoraes,
Taking a quick step back here, something I should have noticed earlier. You mention using the the 515F-806R primer set which would give you an amplicon size of 806-515~291 nts, not 390. The 390 you are seeing in that page is including the total size of the amplicons including all the non-biological sequences such as linkers, adapters, barcodes etc. Here we are only focused on the biological sequences.
So with your 2x250 bp PE reads, you have 500-291~209 bp overlap region.
In both of your reads the merging is not an issue, look between the “denoised” and “merged” column, and you’ll see almost all of your reads are being merged. Where you are seeing the biggest drop in reads is at the initial filtering step. Look at the difference between the “input” column and the “filtered” column. Run 1 performs better because it is in fact trimming more of those poor quality reads and so less reads are discarded at the initial filtering step. This agrees with what I described earlier. Re-run DADA2 but truncate significantly more off the 5’. Since you have about 290 overlap region, you can comfortably truncate as much as a combined let’s conservatively say 260 off the 5’. This should significantly improve # of reads retained at the filtering step.
Try that let us know how it goes.

JoaoGabrielMoraes · June 24, 2020, 7:48pm

Hi @Mehrbod_Estaki,

Thank you so much for clarifying that. I have re-done the analysis as suggested using '--p-trunc-len-f' set as 110 and --p-trunc-len-r set as 150.

Please see the attached results denoising-stats-dada2.qzv (1.2 MB)

Could you please share your thoughts on what are good benchmarks for:

percentage of input passed filter
percentage of input merged
percentage of input non-chimeric

Also, what are your suggestions for troubleshooting samples with low values?

In our data (attached), the average percentage percentage of input that passed filtering was 80%; the average percentage of input merged was 32% and; and the average percentage of input non-chimeric was also 32%.

Although the percentage of input that passed the filter was not bad, merging was really low (over 40% of the samples had less than 10% of the input merged).

I would appreciate any suggestions on how to proceed to troubleshoot our data.

Thanks again for you help!!

Mehrbod_Estaki · June 24, 2020, 8:14pm

Hi @JoaoGabrielMoraes,
That is certainly an improvement, but still plenty of room for improvement. You should be able to merge almost all of your input after the denoising step.

Since the issue now is insufficient merging, just adjust for that parameter. Truncate less to allow for more overlap region. I think what has happened here is that you actually had some sequences trimmed from both your 5’ and 3’ in a previous step with cutadapt, so your actual reads are not 250 (your q score plot certainly suggests this too) so our calculations would be off. So factor the length of those removed sequences into your equation.
You’ll have to do some fine-tuning to find the optimal parameters for your data, but I think you should now have all the right information to fine-tune this. Let’s start with setting truncating length to 170 on both F and R and work from there.

Remember that you can always just use the forward reads, setting the truncating length to 150-170. This would give you the maximum # of retained reads as it won’t have the merging issues anymore. You an also use this as your “gold standard” to compare your PE attempts to, to see how close you can get to those values. As for the samples that have very low reads, that just may be an artifact of sequencing, or perhaps there is some biological reasoning as to why some samples have lower depth? low biomass samples? negative controls etc?

JoaoGabrielMoraes · June 24, 2020, 9:03pm

Hi @Mehrbod_Estaki,

Thanks again for your help, I’ll do as suggested.

How should I go by for using only my forward reads? Should I start from the beginning and use just the fastq files of the forward reads and treat my data as I would with single end sequencing?

I tried to run qiime dada2 denoise-paired specifying only the --p-trunc-len-f parameter but I get an error message saying that --p-trunc-len-r is missing.

Thank you, Joao

Mehrbod_Estaki · June 24, 2020, 11:26pm

Hi @JoaoGabrielMoraes,
No need to go back to the beginning, just use dada2 denoise-single instead, the reverse reads will be ignored.

JoaoGabrielMoraes · June 26, 2020, 6:05am

Thanks again @Mehrbod_Estaki!

I did follow your instruction. After running too many attempts, setting the truncating length to 135 for F and R reads worked best for me (please see attached results).

135_135_denoising-stats-dada2.qzv (1.2 MB)

If I truncate more (130 F and 130 R) merging in compromised: 130_130-denoising-stats-dada2.qzv (1.2 MB)

When truncating less (140 F and 140 R) I get slightly worse results. 140_140_denoising-stats-dada2.qzv (1.2 MB)

In you experience, the results that I got truncating at 135 F and 135 R, are they close to what you normally see?

Thanks again, Joao

Mehrbod_Estaki · June 26, 2020, 9:30pm

Hi @JoaoGabrielMoraes,

All 3 results you are getting are slightly less than what I would expect from this primer and your original quality plot you posted, however, you certainly still have sufficient reads to move forward (your lowest sample still has over 6.5k reads). It may be just the nature of your samples/sequences.
Did you try the forward reads only I recommended to compare to? If retaining more reads is your objective, that would probably be your best bet.
The only thing you should consider is that if you are operating at the limits of your overalp/merging calculations, there is a possibility that in the cases where you are truncating more, you are removing taxa that are naturally a bit longer and in doing so introducing some bias. Let’s say those longer taxa naturally occur more frequently in one of your groups, you may be removing them thus introducing some artificial signals. For that reason I like to stay on the safer side when doing PE, and leave an additional 10-20 nt for this ‘natural variation’ in length. Not an issue if just using forward reads.
So you maybe want to compare your runs to make sure there are no significant groups being removed at the truncating 130 run (since that seems to be the one that has best results).

system · July 28, 2020, 3:30am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.