Very low denoising stats from DADA2 using Ion Torrent data

KMaki · March 25, 2020, 4:50pm

Hello,

@Jen_S and I were able to successfully run vsearch for taxonomic classification in our pipeline aftering running DADA2-pyro option since we are using Ion Torrent data. When we examined the results using the taxa barplot, one of the regions looked decent (not great) and one barely classified anything. Since we are using mock samples, we are able to compare expected vs. our results. See below:

V2 region (green is unassigned)

V4 region (much lower percentage of unassigned but staph genus level taxa and hugely overrepresented)

This was concerning, so we worked backwards to see where the problem was. We had no issues importing the files as a qiime artifact and then cutadapt was performed for to remove adaptors and separate by V region.

We used DADA2_pyro and when we ran the denoising stats, we realized this is where our problem lied. For the V2 region (many unclassified taxa). Less than 1% of input passed filter. See example below (this is for only 2 samples from 1 run but other runs for v2 looked similar):

Our better performing region still only had about a 20-30% pass rate (see below):

Comparing this to the results from the Parkinson’s tutorial, it is clear why we have such discrepancy with our taxonomic classification.

We also compared the results from using qiime dada2 denoise-pyro and qiime dada2 denoise-single just out of curiosity and the denosing stats were the same.

Below is our dada2 script-

qiime dada2 denoise-pyro
--p-trim-left 15
--p-trunc-len 250
--i-demultiplexed-seqs ./v2f/run01_v2f_trimmed.qza
--o-table ./v2f/dada2_pyro/dada2_pyro_run01_v2f_table.qza
--o-representative-sequences ./v2f/dada2_pyro/dada2_pyro_run01_v2f_rep_seq.qza
--o-denoising-stats ./v2f/dada2_pyro/dada2_pyro_run01_v2f_stats.qza

Do you have any suggestions of changes we can make to improve what passes through the filter?

For our V2 region since the input passing is so low and we would have to improve it by >95% to match with the Parkinson’s denoising stats, can we trust the data from this region?

Thank you!
Katherine

colinbrislawn · March 25, 2020, 9:07pm

Hello Katherine,

Good detective work! I concur with your conclusions. There has got to be a way to get more reads past the filter.

Dada2 trims reads first using the trim-* and trunc-* settings, and then by expected error rate. So by trimming off additional low quality bases, or by reducing your expected error rate, you should keep more reads.

Because the regions are different lengths and have different quality, you might want to trim at different levels for different each of them. How do the quality score plots look for each of your regions?

Colin

KMaki · March 31, 2020, 7:26pm

Hey Colin,

Sorry for the delay here. Thank you for the response. Our initial workflow was to run the quality score plot for visualization for the full sequence. Then we perform cutadapt and dada2 off the V2 segment but we used the quality information from the full sequence to guide our dada2 parameters. We had previously run the quality score visualization off the sequences after cutadapt but the plots looked strange and we thought it was potentially because we were using mock sequences that had less sequence data than regular runs, hence to try and compile all sequence quality info for each run. Both the quality tables for all sequences for a run and the quality visualization after cutadapt are below.

This is the quality information from all sequences (before cutadapt) for 1 run. There is the quality info stats from base position 5 and base position 250 (where we have been trunc from dada2) because we were also curious why we were getting the minimum sequencing length error after base position 25. All 14 runs have this error after base position 25 or 26. We only have 2 samples per one for our mock sequences- one even and one staggered btw.

Below is run 1, position 5:

Run 1 quality table base position 254:

We are seeing the same type of quality table across all reads for other runs as well (see below)
Run 2

Run 3

If I perform quality statistics for just V2f- the quality score table is much nosier and the minimum sequence alert is coming after 9 base pairs:

Run 1 v2f

Run 3 v2f

Is there where our problem is? Based off of your post I also tried to adjust the expected error rate to see how it changed the dada2 stats and as I decreased the expected error rate, I got less sequences and vice versa but the change as I scaled expected error from 2à4 was very modest.

All of the below stats are using: qiime dada2 denoise-pyro --p-trim-left 15 --p-trunc-len 250
run1_v2f default (error 2.0)

run1_v2f (error 1.0)

run1_v2f (error 4.0)

Does any of this info help where we should adjust parameters or settings ?

colinbrislawn · March 31, 2020, 11:45pm

Hello again,

Great! That's the right way to do this.

This is the first settings I would try adjusting, as shorter reads will have few expected errors.

How long are your reads? (I can't really read the x-axis labels on your graphs). Does 250 come before that large drop in quality we see in most runs? If those ending bases that are quite low quality are included, that will totally result in high expected errors and the reads will be removed.

You could also try truncating at a much lower number, say 100, 150, or 200, just to check if read length is the primary driver of high expected error rate and the filtered reads.

Let me know what you try next!

Colin

KMaki · April 1, 2020, 8:25pm

Hi Colin,

Thank you so much! Changing the truncating length made a huge difference. When I ran dada2-pyro with trim-15, trunc-150 my percentage of input passing filter increased dramatically to 87% and 79%

Is there a “rule of thumb” for the initial input passing filter number that we should be aiming for? If I increase the expected error, to say 3, would that greatly increase the change that my sequences would be denoised inappropriately? I know the enemy of good is perfect but want to aim for the highest input without sacrificing data quality.

The IonTorrent Sequences are usually much longer and the sequence quality dropoff in my previous post in most of the interactive quality visualizations are around bp 250 or so. I have seen others who use IonTorrent data not truncating at all or truncating at 250…do you anticipating truncating at 150 causing an issue with Ion Torrent data specifically?

Thank you again for your help with this!!

Sincerely,
Katherine

colinbrislawn · April 2, 2020, 8:04pm

Hello Katherine,

Oh that's great news! I'm glad you are keeping the majority of your data.

Nope. This is a tradeoff between quality and quantity and this is your decision to make. This goes for both length and expected error.

Nope, I think the basic rules apply for both Illumina and Ion Torrent. If anything it's easier for Ion Torrent because you don't have to worry about read joining.

Colin

system · May 4, 2020, 2:14am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.