I have a question regarding sequence length after denoising. I have amplified the V4 region using 515F-806R primers. I trimmed primers and adapter sequences and denoised based on quality scores observed in the demux summary in Qiime2-2022.8.
These parameters resulted in a fairly high retention of my sequence after filtering, denoising, and merging (~86%). I then tabulated sequences to see what the lengths were and was surprised to see the range from 231-420bp. I'm under the impression they should be around 253 bp. Granted my mean length is 253. Is there a parameter I did not include to trim merged sequences to the correct length? Am I joining forward and reverse reads in incorrect places?
This is a great question - and your impression is correct, your sequence lengths should be around 253 bp. A couple of good things to point out with your data set is that your mean length is 253.11, which is right where it should be! Additionally, the standard deviation is only about 3bp which means that most of your data is sitting right around that 253 bp mark.
However, you do seem to have some outliers on both ends (the min and max lengths) - especially that max length, which is much longer than the V4 region. The short answer is that these lengths can be related to non-target DNA, which you will most likely want to filter out unless they are of interest to you. You could try filtering out anything that's shorter than 240 and longer than 255 and see what the statistics look like after that, but that should remove those outliers you're seeing.
This is a great forum post that goes into more detail on this situation, for your reference.