Quality process with DADA2 and number of features

Jibda · June 13, 2018, 2:37am

Hi Everybody

In my research group, we have decided to work with QIIME2 for our analysis of soil samples sequences ( The seqs are from Illumina MiSeq -16S RNA gene paired-end of the V3 and V4 regions and reads of 301 bp) but we have had some troubles . First, the decision of use DADA2 or Deblur taking into account the diferences between the algorithms of both approaches wasn´t easy. We are more inclined to work with DADA2 because in our preliminary test we got more OTU´s (features) with DADA2. Then we were deciding the quality parameters to carry out. We used the Dada2 --p-trunc-len-f 271 and --p-trunc-len-r 210 to trim the 3´end of the forward and the reverse and the other parameters by default. We obtained 2.645 features and we carried out other test with --p-trunc-len-f 270 and --p-trunc-len-r 209 and we got 2.270 features. So we have the big doubt about it because we are moving only 1 bp and the difference is quite big, we realized in the stats that we were losing reads during the merged so we decided to make other test only with the forward and we got 4698 features that´s the other question, how often is used only the forward when we have a paired-end sequencing? Well, for us the quality of the reverse is not so bad. And as you will notice we don´t have much experience in the bioinformatic area, so we will appreciate a help about the quality filter parameters. I left the file of the row sequences if you can take a look over it.
Other important info is that we were working with qiime 1.9 and with a similar quality process we got around 16.000 OTU´s (features). Some suggestion about this? we have compared the number of OTU´s from qiime2 with the literature and seem low in comparison :S

Thaks!!!
16S-paired-end.qzv (284.7 KB)

Nicholas_Bokulich · June 13, 2018, 3:32pm

This is clearly a merging issue, as you say. How long should your amplicon be? Depending on the total length and the quality of the reads, it is not improbably that a 2 nt difference will lead to a major change in the number of features that successfully merge. Using only forward reads and obtaining many more features makes sense following this same logic: all features are being retained because you do not have merging issues.

If you use denoise-paired and the reads do not overlap sufficiently after quality filtering, both reads will be dropped. There is no way to have only the forward read retained. You need to aim to have at least 20 nt overlap to obtain successful merging. Length variation within the 16S can cause some shorter amplicons to still be dropped.

QIIME 1 does not have a similar quality process at all to QIIME 2. You are describing one of the major issues with OTU picking that dada2 and deblur seek to correct: OTU picking does not adequately control for sequencing error and other noise, leading to massively overinflated diversity, which must be aggressively filtered to obtain reasonable values. See this article for an example of this, as how QIIME 1 OTU data should be further filtered.

The literature is most likely wrong. Older literature will have used OTU picking methods, and hence have inflated (wrong) alpha diversity estimates. It is possible that dada2 and deblur are overstringent and that some features that are filtered out (e.g., singletons) are real features, but these methods are much more likely to give accurate estimates of true diversity than old OTU-picked literature.

That's fine. Go with whatever makes you feel good, but keep in mind that unless if you are running a true benchmark here (e.g., with a mock community or other sample with known composition) then you are really just making an educated guess at which method seems to be the best for you — you don't actually know which is more accurate.

Good luck!

Jibda · June 13, 2018, 10:49pm

Thanks very much for your fast answer. About the question of the amplicon size, we use the 341f and 805r primers, so theorically we have an amplicon of 464 bp with the natural variation of the 16S gene +/- 20 bp.
With the results of the test using only the forward where we got 4698 features, we are changing the parameters to reach a greater number of features but using the forward and the reverse reads. So, I know that inside DADA2 the merge process is carry out but I´m not sure about the overlap lenght of forward and reverse reads for joining and the maximum number of mismatches in the forward/reverse read overlap that has DADA2 by default, these things aren´t clear in the process or in the script. Have I to join the reads with other tool like vsearch to manipulate these two parameters?

Thanks!

Nicholas_Bokulich · June 14, 2018, 6:59pm

Again, you need to aim for ~20 nt of overlap, and try to account for shorter amplicon variants when calculating this overlap.

You used these trim settings:
271 + 210 = 481 - 464 = 17 nt overlap avg
270 + 209 = 479 - 464 = 15 nt overlap avg

Neither of these is entirely sufficient. So it makes sense that you are losing a lot of seqs during merging, and it also makes sense that a 2 nt different will cause many more seqs to drop!

aim for ~20 nt of overlap on your shortest variants. So increase trim lengths, if you can.

@benjjneb may be able to comment on the number of mismatches allowed in the joining region.

No — do not merge prior to dada2. You could merge using q2-vsearch, then denoise with deblur instead if you want to have more control over merging parameters.

I hope that helps!

system · July 16, 2018, 12:59am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.