Where to trim reads

Nicholas_Bokulich · August 29, 2018, 1:41am

Sounds reasonable. And running just the forward reads is a great idea just in case (and if nothing else, it is a useful comparison)

Based on all the above discussion, I would think yes subtract the primer lengths from the total amplicon length you have cited, unless if you aren't sure whether that length includes the primers. (to be on the safe side perhaps we should just assume that it does include the primers and you so would not subtract them).

It is lower that I like personally, but really does not sound bad. I would personally opt for fewer longer sequences than more shorter sequences, so long as I have enough reads.

So it is definitely worth running all ways and compare the stats output to decide what works best for you.

You could also experiment with the --p-trunc-q parameter instead — since you are joining pair-end reads it may prove useful here for getting more length out of higher-quality reads, instead of trimming all at the same length.

Sorry this is such a difficult trial-and-error process! The good news is all steps after denoising are usually easier..

I hope that helps!.

ariel · August 29, 2018, 2:00am

Okay, thank you. I also tried one cutting at 293, which has two 13 bases and a 14.

I guess i'm wondering, if I can get enough reads out of this, is using lower quality + paired end still better than just using the forward?

In terms of --p-trunc-q

"Reads are truncated at the first instance of
a quality score less than or equal to this
value. If the resulting read is then shorter
** than trunc_len_f or trunc_len_r**
** (depending on the direction of the read) it**
** is discarded."**

I'm wondering what you think would be a useful way to use this paramter in this context.

Nicholas_Bokulich · August 29, 2018, 2:33am

That's very subjective. I'd say no, if you don't get enough reads then shorter is better (and I have gone that route when in your position). But if it is a matter of losing some samples that maybe are not critical... then it becomes a balance of priorities.

You could just omit the trunc_len parameters and I think this should still work (and rely on the read merging stage to decide whether you have trimmed too much). But you could also set this to settings that you KNOW will not work (with wiggle room for length variation and the possibility that one read is high quality and truncated less while the paired read may be truncated more within reason).... so maybe 200 nt in this case?

I hope that helps! Please share the stats results when all runs are done and we can help make a decision.

ariel · August 29, 2018, 9:26pm

Hello! Excitingly it worked a lot better with 300f and 273r. The person that did the sequencing didn't think there was any one length of the region that could be calculated, but noted that because it's a variable region it can vary quite a bit, even to 550nt. So in the case of even 550nt, it looks like this cutoff would still give 26nt of overlap

What do you think about the fact that several lower quality bases had to be used with this cutoff? It seems to me that this is just what had to be done with this particular data? And it seems fairly high quality overall. I'm moving forward with this one.

Even with this cutoff, if i want to use only samples with say > 10,000 reads, I would have to discard a couple hundred, and about 100 if I want use samples with > 5000 reads, though this may be okay.

Interestingly, I also tried cutting at 300f and 292r, and despite the additional overlap, it actually did not merge as well as 300f and 273r, in fact it was much worse presumably because the sequence quality really drops off after 273.

The forward only attempt also worked well, and there I would need to discard fewer samples, however I'd like to use paired end if possible since my reading indicates that it is better able to align to reference sequences and detects repetitive elements and rearrangements

Thank you again for all of your help!!

p.s. I also tried several different quality cutoffs 20, 24, 26, and specified 200 f and r, however none of those worked. They all had mostly 0, and a couple 2 merged non-chimeric reads. Anyway, using the forward and reverse cuts that I did was great, just wanted to share that as well.

Nicholas_Bokulich · August 30, 2018, 1:05am

That's fantastic! This does look a lot better. You are still losing some sequences during merging, but this is a small fraction so probably doesn't matter.

If you can, it may be interesting to run all the same downstream analyses (or at least look at taxonomic composition and beta diversity) on both the single-end and these paired data to make sure this does not matter. But don't sweat it.

It is fairly high-quality. And put it this way: if including the lower-quality bases were a problem, dada2 would let you know! (by dropping at the filtering step). You have lots of reads output, so can move on.

Longer sequences contain more information, so it is always better when possible.

Thanks for sharing! Very glad you were able to get this working in the end (it is almost always a maddening trial-and-error process!)

system · September 30, 2018, 7:05am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.