I am analyzing data obtained with the EMP protocol for fungi-ITS1 region (soils). I have 3 runs. Two runs were processed with the Illumina V3 kit (2x300 bp) but the third came from a run with the kit V2 (250 x2 bp). Of course the problem here is that if I ignore the shorter read lengths of the third run, I would get different ASVs from the same biological sequence in the third run.
Also, the highly variable read length of the ITS regions complicates this more. A solution for this particular problem is implemented in dada2 in R, with the function “collapseNoMismatch”, that “ collapse together variants with no mismatches or internal indels but that could differ by terminal gaps, i.e. variants that differ by their lengths and nothing else”. This would be great! Is there something equivalent in qiime2?
Other 2 alternatives approaches that I tried with qiime2 are:
- Trim the 300 bp raw sequences (R1 and R2) to 250 bp before DADA2, but I cant find a tool to do this in multiple fastq.gz files (I tried seqkit), and also I am not 100% if this would cause any bias. Something in qiime2 tools for this?
- Denoise the three runs (I do the different runs separately and later merge ASV tables) using the option --p-trunc-len. In this example, I just use the first 175 bp of the R1 only (after cutadapt):
qiime dada2 denoise-single
The rationale of this is that using the first 175 bp from the reads, I get rid of the problem. But also not 100% sure.
Any suggestion? Thanks!