ITS analysis: does it make sense to use dada2 trunc-len parameters?

jsogin574 · September 12, 2019, 7:14pm

Just a thought on this topic:

Depending on your read lengths, I don't think you should trivially use the p-trunc-len parameters in Dada2 for ITS analysis.

From a recent paper in PLOS One Evaluation of the ribosomal DNA internal transcribed spacer (ITS), specifically ITS1 and ITS2, for the analysis of fungal diversity by deep sequencing ; this group was doing an in silico analysis to determine if ITS1 or ITS2 should be used as the barcode for fungi studies. Using the UNITE database, the group found:
" The length of the extracted ITS1 portions ranged from 9 bp to 1181 bp, with an average length of 177 bp, and the length of the extracted ITS2 portions ranged from 14 bp to 730 bp, with an average length of 182 bp, among the fungi."

If you are utilizing 250 or 300 bp reads, your read length could far exceed the ITS fragment in question for particular organisms. Cutadapt removes the primers and any nucleotides that come before or after (i.e. your adapters). The dada2 p-trunc-len parameter is then used to remove low quality regions towards the end of reads. However, all sequences shorter than the p-trunc-len are removed. Therefore, if your sample contains any organisms with ITS regions less than your p-trun-length parameter, you would be biasing your analysis to only catch organisms with ITS regions greater than your read length.

I see a couple options to deal with this problem:

Use cutadapt, don't remove low quality regions with the p-trunc-len parameter and proceed
Use cutadapt, then pass through phred-based filtering, then use Dada 2 without the p-trunc-len parameter (though this would inflate your ESV) count because sequences with unique lengths are considered to be unique ESVs)
Accept the bias and proceed

If anyone has thoughts on this or other workarounds that would be great!

Nicholas_Bokulich · September 12, 2019, 7:49pm

I think this is a really good point. After all, we recommend using q2-cutadapt or q2-itsxpress to avoid read-through on short ITS amplicons. Setting truncate parameters with dada2 will then cause these trimmed reads to be dropped.

But as you say it depends on read length. In the case of the tutorial data (the focus of the original question), the reads are all evidently shorter than the total amplicon length and there is no read-through, judging from the cutadapt results. In cases like these, and especially when using dada2 denoise-single, setting a truncation length can be useful for simplifying quality control and downstream processing.

the minimums in those ranges seem extremely short, much shorter than I have seen reported elsewhere in the literature, and I suspect may be errors in their simulation.

I would add a 4th option to your proposals:

Use q2-itsxpress and/or cutadapt, examine read lengths before and after (to assess how much trimming occurred). Examine the read length distributions to see (a) what truncation lengths are acceptable and (b) if the length distributions make sense (e.g., very short reads could be junk!)
Use dada2.
2a. If denoise-paired, definitely don't use truncation unless if it is needed for read quality purposes.
2b. if denoise-single, test out a reasonable truncation length and pay close attention to the dada2 stats output to make sure you are not losing too many reads during the initial filtering step (if you do, it is not related to pre-trimming if the truncation setting is lower than the minimum trimmed read lengths).

llenzi · September 13, 2019, 12:41pm

Hi both,
adding my 2 cent on that.

@Nicholas_Bokulich when you say don't use truncating in your option 2a, you refer to the truncation length parameters only?
What about to keep '--p-trunc-len-f 0' and '--p-trunc-len-r 0' but also increase the setting for '--p-trunc-q'?
This is what I'm doing in my pipeline so to have a more dynamic approach rather than apply a fixed length to all the reads. Would this make sense to you?
Luca

jsogin574 · September 13, 2019, 1:50pm

@llenzi

Two questions for you:

If you use cutadapt, do you do any length based filtering of the reads obtained after primer removal (even 25 nt or so)? And if you’ve done any optimization with this, do you find that Dada2 behaves differently when you give it a few reads that are really short?
(I know this is extremely run dependent) but what have you found to be a reasonable range for p-trunc-q to still get a good number of reads passing filtering and merging?

Nicholas_Bokulich · September 13, 2019, 2:04pm

yes you are correct — I was only referring to --p-trunc-len* params.

For single-end reads I would discourage this, as it would lead to a large number of ASVs that may be genetically identical but grouped as separate ASVs because they are of different lengths (even 1 nt different!)

For paired-end reads I think it is fine. You may lose some reads because the read is being trimmed too much, but if it's because the Q-scores are garbage, then that's the intention!

Nor would it be inconsistent with @jsogin574's advice: @jsogin574's concern is that trunc-len filtering would cause cutadapt-trimmed reads to be dropped, but q-score based filtering would not do that (since it is based on extant read length, not an arbitrary value).