Trim/trunc length for ITS

Fabs · July 20, 2018, 8:38pm

DADA2 QZV Do these work?

denoising-stats.qzv (1.2 MB)
table.qzv (375.6 KB)

I will check those out, I will also update qiime2 so I can get all the information required.

Nicholas_Bokulich · July 20, 2018, 9:08pm

Thanks! That works.

So my hunch was wrong. This is not a merging problem:

Most sequences are being filtered out at the "filter" step, which seems a little weird because your quality plots looked so nice. Very few are dropped at merging. So you should adjust your truncation parameters to cut out more noise! Still let's discuss merging because the more you truncate the more you risk turning this into a merging problem.

Yep, looks like you figured it out. You could also even consult an agarose gel if you ran these amplicons on a gel (and I might trust that more... 230 sounds really short from my foggy recollection of where ITS1F is positioned in the 18S ).

Looks like my recollection is not too foggy after all: 230 is shorter than the average amplicon size you should expect (I wonder how they derived that number! probably the average of their amplicons, which is organism specific, see below). ITS1F is positioned near ITS1F_KYO1 according to this:

Which according to my own dusty old research (see table 2) has an amplicon length of 275.3 ± 103.2 for Ascomycota and 285.3 ± 50.1 for Basidiomycota.

Note also that ITS is hypervariable though (see the SD in those numbers above!), so whatever average we decide on is just that — an average — and depending on the species that are in your samples this could be much higher... and trimming at even full length could cause some species to drop out — this would bias particular organisms, since the length would be particular to the taxon. So check out the amplicon distribution in your own samples if you can to decide on a reasonable upper limit, rather than an average. Use that for deciding how much trimming you can afford.

You need minimum 20 nt for reasonable overlap. And let's say you want to aim for ~480 nt long amplicons as an upper bound (Ascomycota mean length 275.3 nt + (2 standard deviations * 103.2)). So that 480 + 20 nt = 500 nt must be your minimum combined paired-end read length (excluding trimming from the 5' ends).

This gives you lots of wiggle room for truncating your sequences, but it's based on that dusty old table! Check out your actual amplicon length distribution if you can (or if you are not studying anything weird like Glomeromycota just feel free to use that dusty old table).

So you could, say, truncate at 260 forward and 240 reverse and be okay.

You could also experiment: truncate at different lengths and look at your dada2 stats to see where reads begin dropping out at the merging stage.

There is a balance to strike here: the more you truncate, the less you will lose sequences to filtering, but the more you will lose them at merging. Make sense?

I hope that all makes sense, and I certainly hope it helps.

Fabs · July 20, 2018, 9:20pm

You are amazing thanks for all the information

I am only looking to find the ectomycorrhizal fungi in my samples, but I am not sure what I will find, so I want to conserve as much as my data as possible.

Last night after doing some online research, I did notice that the filtering step removed a lot of my data, but, I could not figure out how to deal with it, but I will update both the VM and Qiime2 versions and rerun my demux summary and then rerun DADA2.

I will keep you updated, as I've asked a million questions and you've given me such valuable information, that I'm sure will be helpful for other users.

Thanks again

Fabs · July 21, 2018, 12:13am

Hi Nicholas,

Quick update, I hope you get this before you leave for the day.

Based on this, my average appears to be 301 nts, but am a little confused on how to use this value to get the trim/trunc parameters. I am still researching and trying to understand your above post, but I wanted to send this over in case you were still in your office. (This are the demultiplexed samples, 5'end amplicons and primers removed, 3' end still attached). I am trimming the 3'end and can submit that summary in a bit.

Nicholas_Bokulich · July 21, 2018, 2:25am

301 nt is the read length, not the amplicon length.

This summary will be useful for figuring out if cutadapt successfully trims the reads, and how much trimming occurs; but is not useful for figuring out the amplicon length.

I hope that helps clarify! Let me know if anything else is unclear.

Fabs · July 23, 2018, 6:17pm

Hi Nicholas,

So I had a fun weekend, I performed various DADA2 runs with different trim/trunc lengths, starting with the 299/290, 290/280, going down to your suggestion 260/240, 200/200 and lastly 150/150 (found a tutorial by Kenedy's lab (Univ of Minnesota) where he trims the ends at 150bp for both F/R ends. Based on this, I looked over my trimmed-demux summary data and the average length of my samples was 137bp/149bp. I just wanted to run the numbers by you and see what you thought.

I am currently running a 250/250, in case, but the runs above 200 (prior to the one currently running), failed and looked very similar to the one above. Any comments are highly appreciated.

150/150 Trim/trunc Had originally said this was the 150/150 values, but I mislabelled them, the 150/150 is the bottom values this is 200/200.

200/200 Trim/Trunc

Thanks

Nicholas_Bokulich · July 23, 2018, 6:35pm

150/150 seems overly stringent here — but does not result in so many fewer merged reads, so most of your amplicons are probably on the short side (i.e., centering around that mean amplicon length of 275.3 nt, which would be fully covered by overlapping 150X PE reads)

Since your data appear to be reasonably high quality, I would personally stick with less truncation (i.e., longer sequences) just in case there are some longer amplicons in there — don't want those to drop out due to insufficient merging.

In any case, 200/200 looks good and seems to have more merged reads output than 150/150 (consistent with my worries that some of the longer amplicons may be failing to merge with 150/150). If your other tests yield more merged reads, go with that run!

Sounds like you've done a good little benchmark here to figure out what works best for your data! The good news is all steps after denoising should get "easier" (i.e., less fiddling with data required )

I hope that helps!

Fabs · July 23, 2018, 7:25pm

Perfect, thank you for explaining the results again. I definitely do not want to lose any data, so as soon as the other run goes through, I will compare the merge reads. and make my decision based on your recommendation.

Thanks again

system · August 24, 2018, 1:25am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.