I was wondering when using deblur denoising on ITS1 data (variable length) generated from Miseq 2x300bp run, should I set the --p-trim-length to some lower number to keep shorter ITS1 species (eg 150, with the cost of taxonomy resolution), or could I avoid trimming (-1)? What is the downside for not trimming to the same length in deblur?
Also, how can I view the length distribution of quality-filtered data? I only found quality plot using the following:
qiime demux summarize
I would try this both ways to see how it impacts the results. See below.
The downside is just that sequences of different lengths can wind up dereplicating as unique sequence variants even if they should technically bin together as a single variant (at different lengths, who knows).
With ITS1 I am always a bit nervous about paired-end data, just because it is a hypervariable region and I believe some clades do have ITS1 > 600 bp long. Long variants that are dropped due to lack of sufficient overlap will bias against these clades (length is variable but not randomly distributed across clades). I would pay close attention to how many sequences fail to merge — of course it could be due to low-quality sequences at the tips failing to overlap, but it could also be very long variants that fail to overlap. If you have some way to figure out what’s what, it would be beneficial (to everyone who has this problem!)
We do not have a good way to look at length distribution in QIIME2 — currently. But it is on our radar and should be available in a future release. We will post back here when that feature is available.
For now, you could export those sequences and use something like stand-alone vsearch to get quick stats on sequence length (vsearch is included in the QIIME2 installation so that should be easy).