Dear all,
Hello, I’m a researcher who is just beginning work in the microbiome field, and I’m currently facing some challenges during the initial preprocessing steps. I’m writing this message in the hope of getting advice from those with more experience.
The dataset we’re working with was generated using iSeq with single-end reads (300 bp), targeting the V4 region. Since it's not paired-end, we’ve encountered some issues during preprocessing.
So far, we have removed the primers and are planning to apply filterAndTrim
using a Q30 quality threshold. However, I was wondering — when working with ASVs, is there a recommended minimum read length after filtering? I’ve had difficulty finding clear guidance in the literature, so if there are any papers or official recommendations on this topic, I would greatly appreciate it if you could share them.
Thank you very much in advance for your help and advice.
Best regards,
Good morning! Welcome to the Qiime2 forums! 
Illumina sequencers and single-end reads should both be well supported by Qiime2 and other software. Of course, it's good to graph the quality scores to see how this specific run turned out!
ASVs can be any length, but longer reads and ASVs give more information that can improve taxonomy resolution. So longer is better.
However, when working with ASVs, biologically consistent length is more important than longer length. I recommend this article on the benefits of Global trimming
I’ve had difficulty finding clear guidance in the literature, so if there are any papers or official recommendations on this topic, I would greatly appreciate it if you could share them.
This all depends on the pipeline being used! The manuscript on the software should justify its pipeline and trimming settings!
For example, in the moving-pictures tutorial: option 2: deblur, they use a q-score filter and trim to a consistent length and cite Bokulich et al. (2013).
DADA2 works differently and so different filtering methods are needed.
Thank you for your detailed response. it was very helpful!
To maintain consistent read lengths, I used plotQualityProfile
to determine the position where quality reaches Q30 and set that value for truncLen
in dada2::filterAndTrim
. At the same time, I set truncQ = 0
to avoid variability in read length caused by quality-based truncation.
I’m wondering, could setting truncQ = 0
cause any issues? And what would be the general guideline or best practice for deciding on this setting?
Maybe! Any reads with that low quality will probably be removed by DADA2 during it's filter step, so it my be okay but unneeded.
EDIT: I misread your comment and thought the reads were trimmed at differing lengths when they reached Q30! DADA2 recommends applying a uniform truncation length to your input reads as a quality control step, which is what you are doing!
I think this is likely to cause more issues, but I suppose you can see for yourself in the DADA2 stats file. After running DADA2 with these settings, how many reads pass filter?
What kind of issues could be caused by this process? Most of the samples retain around 6,000–7,000 reads each.
Thanks for providing those numbers. 6,000-7,000 reads per sample may be an okay number, so there may not be a problem at all.
Choosing a truncLen
using average Q score might work okay (or not!).
My goal when setting truncLen
is to maximize the number of reads that pass the quality filter and can still join, so 30 might be a little high or a little low, depending on the run. I run DADA2 with multiple different settings and choose the one that works best!