2021-11-03 trimming/trunc question

marcelpolling · November 3, 2021, 9:49am

Hi all,

Thanks for the usefull explanation @lizgehret!

I do, however, still have a question related to this, since I can't seem to get my head around this truncation.

Here's an image of paired-end, 2x250 Miseq reads that were stripped of their FW and RV primers using Cutadapt in both fw and reverse reads. The amplicon size is 158bp for this COI dataset (using specific primers for mites, no variation in amplicon length).

Now if I use the trunc-len parameters, I can only specify the minimum length (I could use p-trunc-len 157). But how do I get rid of the longer reads in this case?

edit: Maybe I should add that my problem would probably be solved I could specify a maximum length in cutadapt.. although then I would have to account for the minimum overlap in the next step..

Thanks so much for letting me know,

Best regard

Marcel

thermokarst · November 3, 2021, 2:27pm

Hi there @marcelpolling!

Specifying a trunc-len of X will do two things:

Remove 3' nts from every read longer than X in order to bring the read's length down to X
Drop any read shorter than X - you can't truncate something that is shorter than the target length.

Keep in mind that for PE seqs the trunc-len parameter is separated by read direction.

My first suggestion is to run this through DADA2 with whatever trim/trunc params make sense for removing the noise - then check and see what the output read length distribution looks like - perhaps merging will take care of most of the length discrepancies for you. If not, we can apply some additional filtering after DADA2 - let's check in then.

Keep us posted!

:qiime2:

PS - I moved this out into its own topic, that way its more easily searchable for future readers. Thanks!

marcelpolling · November 3, 2021, 4:52pm

Hi @thermokarst and thanks so much for your quick response

I've first tried to get rid of the noise using the following trunc parameters in DADA2:

--p-trunc-f 157
--p-trunc-r 157

Not entirely sure if this is what you meant, but it does produce only features of lengths >156bp, but running up to 299bp (here's a screenshot of a subset of two samples I'm running just to test this):

If I use any longer trunc-lengths, I remove all my features and nothing is left.

I just came across this thread where you show how to filter out sequences based on length without having to export. I tried this command using '--p-where 'length(sequence) < 160' on the features, which works fine (now I only have ASV's of 157-159 bp), but it just feels like I'm working around the truncation step somehow..

Please let me know your thoughts, I highly appreciate the feedback.

Best regards,

Marcel

thermokarst · November 3, 2021, 5:00pm

No - not quite - I was referring to the trim/trunc params, and what they do to the sequences in DADA2, prior to denoising and joining. Post-joining (which is what you just shared in your screenshot) is a wholly different beast. The read joining ideally joins in a way that produces sequences that are precisely your target region, but sometimes things go haywire (which is why we have trim/trunc params - removing messy/noisy/problematic nts really goes a long way with improving joining).

Perfect! That's what I was going to suggest next!

Hopefully what I just shared above clarifies that you're now looking at what is basically all new data - this isn't a workaround at all. Before you get too far out from your denoising though, let's double check the "Seven-Number Summary of Sequence Lengths". Hopefully you don't have too many reads that are longer than your expected sequence length. If you do though that might be an indication that your trim/trunc params need to be tweaked a bit further before proceeding.

This step of an analysis can be a bit cyclical!

:qiime2:

marcelpolling · November 3, 2021, 5:32pm

Ok I think I got it, and indeed the Seven number summary shows that only a small percentage is longer than the expected 158bp:

So that means the merging/joining is going well, but just some erroneous ASV's that I can either filter out at a later stage, or using the 'p-where' command?

system · December 4, 2021, 11:33pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.