@Jen_S and I were able to successfully run vsearch for taxonomic classification in our pipeline aftering running DADA2-pyro option since we are using Ion Torrent data. When we examined the results using the taxa barplot, one of the regions looked decent (not great) and one barely classified anything. Since we are using mock samples, we are able to compare expected vs. our results. See below:
This was concerning, so we worked backwards to see where the problem was. We had no issues importing the files as a qiime artifact and then cutadapt was performed for to remove adaptors and separate by V region.
We used DADA2_pyro and when we ran the denoising stats, we realized this is where our problem lied. For the V2 region (many unclassified taxa). Less than 1% of input passed filter. See example below (this is for only 2 samples from 1 run but other runs for v2 looked similar):
We also compared the results from using qiime dada2 denoise-pyro and qiime dada2 denoise-single just out of curiosity and the denosing stats were the same.
Do you have any suggestions of changes we can make to improve what passes through the filter?
For our V2 region since the input passing is so low and we would have to improve it by >95% to match with the Parkinson’s denoising stats, can we trust the data from this region?
Good detective work! I concur with your conclusions. There has got to be a way to get more reads past the filter.
Dada2 trims reads first using the trim-* and trunc-* settings, and then by expected error rate. So by trimming off additional low quality bases, or by reducing your expected error rate, you should keep more reads.
Because the regions are different lengths and have different quality, you might want to trim at different levels for different each of them. How do the quality score plots look for each of your regions?
Sorry for the delay here. Thank you for the response. Our initial workflow was to run the quality score plot for visualization for the full sequence. Then we perform cutadapt and dada2 off the V2 segment but we used the quality information from the full sequence to guide our dada2 parameters. We had previously run the quality score visualization off the sequences after cutadapt but the plots looked strange and we thought it was potentially because we were using mock sequences that had less sequence data than regular runs, hence to try and compile all sequence quality info for each run. Both the quality tables for all sequences for a run and the quality visualization after cutadapt are below.
This is the quality information from all sequences (before cutadapt) for 1 run. There is the quality info stats from base position 5 and base position 250 (where we have been trunc from dada2) because we were also curious why we were getting the minimum sequencing length error after base position 25. All 14 runs have this error after base position 25 or 26. We only have 2 samples per one for our mock sequences- one even and one staggered btw.
Is there where our problem is? Based off of your post I also tried to adjust the expected error rate to see how it changed the dada2 stats and as I decreased the expected error rate, I got less sequences and vice versa but the change as I scaled expected error from 2Ă 4 was very modest.
All of the below stats are using: qiime dada2 denoise-pyro --p-trim-left 15 --p-trunc-len 250
run1_v2f default (error 2.0)
run1_v2f (error 1.0)
run1_v2f (error 4.0)
Does any of this info help where we should adjust parameters or settings ?
This is the first settings I would try adjusting, as shorter reads will have few expected errors.
How long are your reads? (I can't really read the x-axis labels on your graphs). Does 250 come before that large drop in quality we see in most runs? If those ending bases that are quite low quality are included, that will totally result in high expected errors and the reads will be removed.
You could also try truncating at a much lower number, say 100, 150, or 200, just to check if read length is the primary driver of high expected error rate and the filtered reads.
Thank you so much! Changing the truncating length made a huge difference. When I ran dada2-pyro with trim-15, trunc-150 my percentage of input passing filter increased dramatically to 87% and 79%
Is there a “rule of thumb” for the initial input passing filter number that we should be aiming for? If I increase the expected error, to say 3, would that greatly increase the change that my sequences would be denoised inappropriately? I know the enemy of good is perfect but want to aim for the highest input without sacrificing data quality.
The IonTorrent Sequences are usually much longer and the sequence quality dropoff in my previous post in most of the interactive quality visualizations are around bp 250 or so. I have seen others who use IonTorrent data not truncating at all or truncating at 250…do you anticipating truncating at 150 causing an issue with Ion Torrent data specifically?
Oh that's great news! I'm glad you are keeping the majority of your data.
Nope. This is a tradeoff between quality and quantity and this is your decision to make. This goes for both length and expected error.
Nope, I think the basic rules apply for both Illumina and Ion Torrent. If anything it's easier for Ion Torrent because you don't have to worry about read joining.