Just a thought on this topic:
Depending on your read lengths, I don’t think you should trivially use the p-trunc-len parameters in Dada2 for ITS analysis.
From a recent paper in PLOS One https://doi.org/10.1371/journal.pone.0206428 ; this group was doing an in silico analysis to determine if ITS1 or ITS2 should be used as the barcode for fungi studies. Using the UNITE database, the group found:
" The length of the extracted ITS1 portions ranged from 9 bp to 1181 bp, with an average length of 177 bp, and the length of the extracted ITS2 portions ranged from 14 bp to 730 bp, with an average length of 182 bp, among the fungi."
If you are utilizing 250 or 300 bp reads, your read length could far exceed the ITS fragment in question for particular organisms. Cutadapt removes the primers and any nucleotides that come before or after (i.e. your adapters). The dada2 p-trunc-len parameter is then used to remove low quality regions towards the end of reads. However, all sequences shorter than the p-trunc-len are removed. Therefore, if your sample contains any organisms with ITS regions less than your p-trun-length parameter, you would be biasing your analysis to only catch organisms with ITS regions greater than your read length.
I see a couple options to deal with this problem:
- Use cutadapt, don’t remove low quality regions with the p-trunc-len parameter and proceed
- Use cutadapt, then pass through phred-based filtering, then use Dada 2 without the p-trunc-len parameter (though this would inflate your ESV) count because sequences with unique lengths are considered to be unique ESVs)
- Accept the bias and proceed
If anyone has thoughts on this or other workarounds that would be great!