Trim/trunc length for ITS

Nicholas_Bokulich · July 20, 2018, 9:08pm

Thanks! That works.

So my hunch was wrong. This is not a merging problem:

Most sequences are being filtered out at the "filter" step, which seems a little weird because your quality plots looked so nice. Very few are dropped at merging. So you should adjust your truncation parameters to cut out more noise! Still let's discuss merging because the more you truncate the more you risk turning this into a merging problem.

Yep, looks like you figured it out. You could also even consult an agarose gel if you ran these amplicons on a gel (and I might trust that more... 230 sounds really short from my foggy recollection of where ITS1F is positioned in the 18S ).

Looks like my recollection is not too foggy after all: 230 is shorter than the average amplicon size you should expect (I wonder how they derived that number! probably the average of their amplicons, which is organism specific, see below). ITS1F is positioned near ITS1F_KYO1 according to this:

Which according to my own dusty old research (see table 2) has an amplicon length of 275.3 ± 103.2 for Ascomycota and 285.3 ± 50.1 for Basidiomycota.

Note also that ITS is hypervariable though (see the SD in those numbers above!), so whatever average we decide on is just that — an average — and depending on the species that are in your samples this could be much higher... and trimming at even full length could cause some species to drop out — this would bias particular organisms, since the length would be particular to the taxon. So check out the amplicon distribution in your own samples if you can to decide on a reasonable upper limit, rather than an average. Use that for deciding how much trimming you can afford.

You need minimum 20 nt for reasonable overlap. And let's say you want to aim for ~480 nt long amplicons as an upper bound (Ascomycota mean length 275.3 nt + (2 standard deviations * 103.2)). So that 480 + 20 nt = 500 nt must be your minimum combined paired-end read length (excluding trimming from the 5' ends).

This gives you lots of wiggle room for truncating your sequences, but it's based on that dusty old table! Check out your actual amplicon length distribution if you can (or if you are not studying anything weird like Glomeromycota just feel free to use that dusty old table).

So you could, say, truncate at 260 forward and 240 reverse and be okay.

You could also experiment: truncate at different lengths and look at your dada2 stats to see where reads begin dropping out at the merging stage.

There is a balance to strike here: the more you truncate, the less you will lose sequences to filtering, but the more you will lose them at merging. Make sense?

I hope that all makes sense, and I certainly hope it helps.