I’ve followed both methods: Dada2 and Deblur on my Illumina MiSeq paired-end demultiplexed sequences (16S, v3-v4). This sequences were adapters- and barcodes-trimmed by a company which sequenced my samples.
I’ve noticed that using Deblur, there is no parameter to trim first “n” bases (for example primers sequences), while in Dada2 there is an option:
--p-trim-left-f INTEGER --p-trim-left-r INTEGER
My questions are:
how this trimming step could be applied during debluring
what are the consequences of not-trimming of primers sequences ?
It looks like we have an open issue around exposing this parameter from deblur to the q2-deblur plugin. So, in theory, deblur supports the kind of trimming, it just isn’t possible to use that parameter yet in q2-deblur. Stay tuned!
I will defer to a deblur developer here, but my guess it would be wise to trim those primers.
Removing adapters is something that I’m not even sure if someone has actually benchmarked in a programatic and extensive way. However, there are pros and cons about doing this. For example, on one side removing them will allow you to remove some “biases” during clustering, taxonomic assignments, etc (*) and not removing them will allow you to have a longer read, also biasing some of the other steps. Additionally, remember that normally your higher quality is at the beginning of the read. Thus, in praxis it depends on the protocol you follow what will be the bets. Note that we normally don’t have to remove them as we follow the EMP protocols, which sequencing starts just after the adapter.
(*) Imagine that you have a 150bps read and your adapter is 15bps so during clustering these section is almost the same and in some cases this might bias the 97% similarity clustering. Just remember that DADA2 and Deblur don’t actually cluster so these tools, in theory, are not affected by this.
Just to add to @antgonza’s explanation — primer sequences should be trimmed prior to denoising with dada2, as explained here and noted in the dada2 faqs. In dada2, this can be accomplished easily with the trim-left parameter.
As far as I know, deblur does not have the same requirement.
Other downstream steps could still theoretically be affected. For example, if you train a feature classifier trimmed to a specific primer region, the trimmed reference sequences will not include that primer sequence. This probably does not have a significant impact on classification, but this is untested.
We have not really benchmarked these effects in any step because, as @antgonza pointed out, we (QIIME developers) are usually working with the EMP sequencing protocol, in which the primers are already trimmed from the reads.
The biggest problem arises from the ambiguous nucleotides in many primers. For example, the 515F primer is GTGYCAGCMGCCGCGGTAA. The two ambiguous nucleotides (Y=C or T, M=A or C) show up in equal proportions (technically there are a mixture of the 4 possible primers used).
As a result, if reads have the 515F primer on them, each real biological sequence will show up in a 25/25/25/25% mixture of the 4 possible primer sequences + real biological sequence. This is very bad when using ASV methods, as they will distinguish those differences and call 4 types for every 1 real type!
It’s less of an issue if there are no ambiguous nucleotides, but still, primer/adapter isn’t DNA from the sequenced organism. That why it is a plus for methods (e.g. EMP) that don’t sequence the adapters/primers/etc, as they don’t waste bases on non-bio DNA.