I do, however, still have a question related to this, since I can't seem to get my head around this truncation.
Here's an image of paired-end, 2x250 Miseq reads that were stripped of their FW and RV primers using Cutadapt in both fw and reverse reads. The amplicon size is 158bp for this COI dataset (using specific primers for mites, no variation in amplicon length).
Remove 3' nts from every read longer than X in order to bring the read's length down to X
Drop any read shorter than X - you can't truncate something that is shorter than the target length.
Keep in mind that for PE seqs the trunc-len parameter is separated by read direction.
My first suggestion is to run this through DADA2 with whatever trim/trunc params make sense for removing the noise - then check and see what the output read length distribution looks like - perhaps merging will take care of most of the length discrepancies for you. If not, we can apply some additional filtering after DADA2 - let's check in then.
Keep us posted!
PS - I moved this out into its own topic, that way its more easily searchable for future readers. Thanks!
I've first tried to get rid of the noise using the following trunc parameters in DADA2:
Not entirely sure if this is what you meant, but it does produce only features of lengths >156bp, but running up to 299bp (here's a screenshot of a subset of two samples I'm running just to test this):
If I use any longer trunc-lengths, I remove all my features and nothing is left.
I just came across this thread where you show how to filter out sequences based on length without having to export. I tried this command using '--p-where 'length(sequence) < 160' on the features, which works fine (now I only have ASV's of 157-159 bp), but it just feels like I'm working around the truncation step somehow..
Please let me know your thoughts, I highly appreciate the feedback.
No - not quite - I was referring to the trim/trunc params, and what they do to the sequences in DADA2, prior to denoising and joining. Post-joining (which is what you just shared in your screenshot) is a wholly different beast. The read joining ideally joins in a way that produces sequences that are precisely your target region, but sometimes things go haywire (which is why we have trim/trunc params - removing messy/noisy/problematic nts really goes a long way with improving joining).
Perfect! That's what I was going to suggest next!
Hopefully what I just shared above clarifies that you're now looking at what is basically all new data - this isn't a workaround at all. Before you get too far out from your denoising though, let's double check the "Seven-Number Summary of Sequence Lengths". Hopefully you don't have too many reads that are longer than your expected sequence length. If you do though that might be an indication that your trim/trunc params need to be tweaked a bit further before proceeding.