Hi @Adam_Rivers, I worked some samples through the dada2 pipeline in R after either using unmodified
itsxpress to trim, or using a version of itsxpress modified to trim 3’ ends as described above. I thought I would update here on some of the results in case it is useful for future potential itsxpress updates. Thanks again for this nice tool! I’ll definitely use this as part of my future workflows.
It looks like there are small differences in the number of ASVs assigned based on the trimming strategy (272 vs. 276 ASVs for “standard” vs. 3’ trimmed sequences, respectively, in 24 samples). There are also differences in the length of ASV sequences between the two methods. A few plots below.
1. Length of sequences after standard vs 3’ trimming with itsxpress (1 sample as an example). These were input to dada2 pipeline.
The variation in sequence length between methods is in R2 (i.e., the read from the LSU primer in this case), whereas majority of R1 reads were trimmed to about 90 bp using either method.
2. Number of unique sequences after dada2 dereplication, denoising (i.e., sequence variant assignment), merging for ASV assignment, and chimera removal across 24 samples.
At the derep and denoise steps there is variation in number of unique sequences in reverse read but not forward. At derep the standard method generally has more uniques sequences (likely due to read-through length variation), whereas at denoise step the trend switches for some samples (due to including conserved sequence?) and is generally less pronounced. There is also a small amount of variation in the number of ASVs assigned between methods.
3. Length of unique ASV sequences. The standard method has a spike of ASVs at about 180 bp, whereas it appears trimming at 3’ results in redistributing those ASVs around ~150 bp with a much larger spread in length (note log10 scale y-axis).
To me this suggests variable amounts of the conserved region (5.8S in this case) are included in the final ASVs when 3’ is not trimmed, which can affect downstream taxonomic assignment, or any additional sequence similarity clustering steps performed after dada2 ASV assignment. This would also impact reproducibility of dada2 ASVs between datasets (i.e., consistent variation in read-through length between studies or sequencing runs could result in different ASV sequences).