Hello, I just finished the running of our data using QIIME2 2017.4. I have several maybe very basic questions about the dada2 trimming and training classifiers:
When we read the quality scores to trim the sequence, there are several options in the interactive quality plot, such as 2nd percentile, 9th percentile.....98th percentile. My question is that which percentile or combined percentiles is appropriate to use to decide the quality score? Do we have to have some other considerations too? Is there any reference to tell what range of quality score is appropriate to use as an acceptable quality score?
I am running the demux paired-end sequence data and we use the protocol V3V4 region gene: Primer pairs were: (i): S-D-Bact-0341-b-S-17, 50-CCTACGGGNGGCWGCAG-30 (32), and S-D-Bact-0785-a-A-21, 50-GACTACHVGGGTATCTAATCC-3 (32). The length of our data should be 406 (785 - 341 - 38) (excluding the primers length 38). When we calculate the overlapping after the trimming, do we count the trim-left 10 (for example)? Also how many overlapping pairs do we need at least to run the next steps in QIIME2? I tried my best but only get about 80 overlapping pairs and am not sure whether that is enough or not??
When we use the V3V4 gene, I want to know whether we need to train a classifier for V3V4 only using the greengage database? What is the difference if I use the full-length greengage sequence to do the analysis? Is it necessary to have a trained V3V4 gene classifier?
Unfortunately there isn't a hard fast rule here, it's going to depend on what you're studying and how much error you can "tolerate" in your conclusions. @benjjneb do you have any particular recommendations here especially with respect to dada2?
For the F/R reads to be successfully merged, trunc-len-f + trunc-len-r must be greater than the length of the amplicon + 20 nucleotides (the 20 nts is the length of the overlap).
It looks like your overlap of 80 is more than adequate. My understanding is that trim-left doesn't really impact your overlap, it just gives you a chance to remove some of the low-quality base-calls that can happen at the beginning of sequencing a read.
Excuse the lame ASCII graphic, but my understanding is that your reads become oriented as so:
Where the x's are the beginning of the read (and what you might trim with trim-left) and the z's are the end of the read and have terrible quality (trimmed with trunc-len). The y's are the base-calls with high confidence and what you would like your final representative sequence variant to be mostly composed of.
My understanding is you will get better results from a classifier that has been specifically trained on your target reigon. @Nicholas_Bokulich or @BenKaehler can probably elaborate on that better.
Classification accuracy is improved by trimming your reference sequences to the primer region prior to classifier training. For example, see this article. In my experience (examining V4 domain reads, not V3V4), the increase in accuracy is only at species level, and is not dramatically greater than training on full-length 16S sequences — so if you are constrained, e.g., by memory requirements for fitting a new classifier, then the difference is not critical. Otherwise, it is worth the effort and this tutorial gives an example command.
If you have the inclination to compare classifications both with and without trimming, please post back here to share your results.
One addition to that: You need to have 20 y's, plus enough additional y's to handle the biological length variation in your amplicon. If you aim for the minimum, you will lose real variants that are a few nts shorter than the average amplicon length, and therefore systematically drop certain taxonomic groups.
20 is the bare minimum, but err on having more overlap. If you can get an overlap of 80 with acceptable quality, I would go for it.