Confusion on selecting the Truncation length and Quality Score for Denoising

Sreevatshan · October 7, 2020, 10:29am

Hi everyone,

I am having doubt on selecting the truncation length of my sample. I had imported the file using SequenceWithQuality. Is it good to select trunc-len at 220 and quality score as 20 for denoising. I had attached both the screenshot and .qzv file for reference.

LogMPIE.qzv (307.6 KB)

ChrisKeefe · October 7, 2020, 8:51pm

@Sreevatshan, thanks for sharing your data. The "best" denoising parameters will depend on your unique study.

There are many good topics about selecting trim lengths etc. on this forum, which I'd encourage you to read if you haven't already done so.

As a starting place for discussion, why did you select 220 as your truncation length?

Sreevatshan · October 8, 2020, 6:35am

Thank you for the reply @ChrisKeefe,
I had selected the trunc len at 220 cause after that quality of the sequences gets reduced. Yes I had read other topics on the forum related to Truncation length.

ChrisKeefe · October 8, 2020, 5:23pm

Good start, then. What is your overall goal in selecting that trunc length? In other words, what effect(s) will that trunc length have on your data, that you're trying to optimize?

Sreevatshan · October 9, 2020, 6:18am

@ChrisKeefe, Around 39 samples had been discarded (out of 1004) and many sequences didnt pass the filter. I had ran with this parameter to check the effect. I had attached the denoised stats for the reference.denoised_stats.tsv (36.1 KB)

ChrisKeefe · October 9, 2020, 7:53pm

@Sreevatshan, based on those denoising stats, it looks like you should try some other truncation lengths. You're losing most of your data.

Do you understand how to interpret DADA2's denoising stats? If not, search the forum for "interpret DADA2 stats". There's a lot of good information here.

Once you understand the stats, you should be able to guess why you're losing so many sequences. You can then look back at the mean quality scores on your interactive quality plot (pictured above), and choose a better truncation point.

Good luck!
Chris

Sreevatshan · October 11, 2020, 3:15am

Hey @ChrisKeefe,
Update -1 I had reduced my trunc-len to 188 and q-score as same as 20 and ran the denoising step again, I had attached the denoised stats below.188-denoised-stats.tsv (37.8 KB)
As you can see only very less sequences had passed this filter from each sample.
So, again I tried with with different trunc-len at 100 and q score as 20 again. This time it was not bad as before two but not good to go.100-denoised-stats.tsv (41.4 KB).
As you had suggested, I had checked with answers related to this one in the forum and I think it is better to go with the trunc-len as 0 and q-score as 20, because it will be give you maximum number of better quality of the sequences. What do you think.? Is it better to go with this one.? If no, can you give some suggestions.?

ChrisKeefe · October 12, 2020, 6:16pm

@Sreevatshan, can you paste the full command you're running? Your sequence quality isn't excellent, but I don't think it's so bad that you should be losing 90% of your sequences.

Sreevatshan · October 13, 2020, 11:07am

Hey @ChrisKeefe, this the command and parameter used for denoising.
time qiime dada2 denoise-single
--i-demultiplexed-seqs Log_Results/LogMPIE.qza \
--o-representative-sequences Log_Results/220-log-rep.qza \
--o-table Log_Results/220-log-table.qza
--o-denoising-stats Log_Results/220-log-stat.qza
--p-trunc-len 220
--p-trunc-q 20
--verbose

ChrisKeefe · October 13, 2020, 5:15pm

Thanks, @Sreevatshan. The strategy people generally use with DADA2 is to select trim/trunc parameters in such a way that the lowest-quality positions are removed - often by truncating immediately before the first instance of a below-quality-threshold base. Try re-running without p-trunc-q, and see what you get.

Big-picture, you might want to spend some time with the QIIME 2 tutorials, the DADA2 paper, and/or reading forum posts. It's super useful to really understand why you're using the parameters you have chosen.

Chris

Sreevatshan · October 24, 2020, 6:57am

Hey @ChrisKeefe,

Thanks for moving me to correct direction, I had just now understood why we are using this step. Anyways, I had ran the step with trunc Len of 230 and the results were better when compared to other results but it is not best. Maybe I will try with it with trunc Len 220 and select which one gives the optimum result.metadata.csv (42.3 KB)
This one is with trunc Len 230.

ChrisKeefe · October 26, 2020, 4:32pm

@Sreevatshan, you're still losing the majority of your sequences. Can you look at the metadata.csv you shared recently and tell me why? If not, take your best guess, and tell me why you think that's the case.

Sreevatshan · October 26, 2020, 4:58pm

Hey @ChrisKeefe, yes I am losing majority of sequences around 10 samples are alone has 80% and others are around 40% and 30% and many of the sequences are lesser than this. It maybe due the trunc - Len I had selected for this run. But the quality gets reduced at 220-230, still don't know how it produced less output.

ChrisKeefe · October 26, 2020, 5:48pm

Why?

Is that true? At what position does the mean quality score drop below 30?

Sreevatshan · October 26, 2020, 6:15pm

forward-seven-number-summaries.tsv (18.5 KB) As we can see, the mean quality drops below 30 after 230 and 220 position (in some middle positions too) , hence I had selected 230 for this run for checking it out.

ChrisKeefe · October 26, 2020, 6:50pm

Are you using --p-trim-left to deal with the poor-quality positions at the 5' end? Ideally, you keep those positions so that it's easier to do meta-analysis, but if they're causing a significant amount of sequence loss, it might be best to drop them.

The worst positions in your plots are clearly early in the early and late positions, but these "middle" positions might be relevant. For example, position 206 also drops to Q=26.

Try trimming off the bad positions at the beginning, and truncating at 205, and let's see what happens.

Looking forward, please consider sharing qza or qzv files rather than csvs. It often makes troubleshooting easier, because QIIME 2 artifacts include all the details of what commands you ran to produce them.

Sreevatshan · October 28, 2020, 9:15am

Hey @ChrisKeefe,

these were the results
stats2.qzv (1.2 MB) rep-seqs2.qza (3.8 MB) table2.qzv (3.4 MB) table2.qza (4.6 MB) stats2.qza (30.7 KB)

I can't find any significant changes with respect to before result.

ChrisKeefe · October 28, 2020, 4:35pm

If I'm doing my math right, these params netted 89,456,115 more reads than you captured in your metadata.csv above - a nearly 2x improvement. However, I agree that the outcome may still not be ideal.

What kind of data is this? These aren't 16s sequences, are they? It looks like you've got some pretty dramatic variation in sequence lengths. The type of data you're working can be important to what you do with it.

Some 16s data

###: Your data