Trim/trunc length for ITS

I see, sorry I guess I took the warning as an error since not everything was “removed”. After adding the extra T, the exact same warning came up, suggesting I add another T.

So you think I should contact them and see what they think? Or could I assume they are technically okay, and this will be removed further down the pipeline when I trim/trunc?

Once again, thank you

No need to contact them — the warning is just a suggestion. You know what your adapters were, so you have much better information that the computer in this case.

Yes, everything is okay. That warning makes complete sense here — the 16S rRNA gene is fairly well conserved and the primers are nestled in highly conserved sites so having highly conserved bases next to the primer/adapter sequence is not a surprise.

As far as I know, cutadapt is not specific to microbiome studies, so this warning may be there for more general uses, e.g., whole genome sequencing where highly conserved bases next to the adapter would rightly set red flags flying! :triangular_flag_on_post:

I hope that helps!

2 Likes

Perfect, I am actually working with the ITS1F/ITS2 regions.

As previously mentioned, I did want to run the graphs to choose a trunc/trim length by you guys, just to feel better about my decision. I have attached the information below and would really love some input.

Based on this graphs and the phred scores, if I am understanding them correctly, I choose the following values. I feel more confident about the my forward read but for my reverse reads I am not 100% confident. Given that the values are dropping, they are dropping at a steady pace, up until ~298, so I want to trim at btwn 290-298 but I don't know if I need to be more conservative.
Again, your input is highly appreciated.

Forward
--p-trim-left = 6
--p-trunc-len =299

Reverse
--p-trim-left = 6
--p-trunc-len = 290- 298?




OUTPUT SUMMARY.QZV
Demultiplexed sequence counts summary

Minimum: 1 (only 1 sample, the next min = 4844
Median: 19601.0
Mean: 21193.5654206
Maximum: 55062
Total: 4535423

Foward reads

*Zoom in Foward reads

Reverse reads

Zoom in on reverse reads

Thanks again :slight_smile:

Hi Nicholas,

Additional question.

I decided to run the trunc/trim as 6/299 for forward reads and 6/290 for the reverse reads and after looking at table.qzv, I noticed that I have terribly low frequencies.( See below)

I am running 2 other parameters to see if maybe I needed to trim more, to remove added noise (currently running), but for the meantime I wanted to ask if you had any idea of what this could mean.

Note: I did read another post where it states that for ITS data, after DADA2 I would run filter-table-feature-filter (is this correct) and would this be okay to do with my samples? Link here (ITS) and here (using filter table-feature-filter on low frequency table.qzv)

Based on this information, could you tell me what is the best step for me to take and or refer me to another link?

oh right of course! reading/writing too fast. :man_facepalming:

same idea, though: the ITS primers are couched in highly conserved sections of 18S and 5.8S rRNA genes that flank the ITS.

Those look like very well-chosen settings, given the high quality of your data.

There is no reason to do that with your workflow. That link you provided was specific to that user's workflow — they are exporting their data and running through an external program before re-importing, so the filtering was done to remove dropped features if I understand correctly. (and an ITSx-like method ITSxpress is also available in a 3rd-party plugin now so there is no reason to follow that workflow any more).

could you please share the actual qzv files? this will be easier to peruse. Please also send a QZV of the dada2 stats. I suspect you probably have a read merging issue; the second link you sent above is probably the most relevant for troubleshooting, though your data quality looks very high.

One problem is that (if I recall correctly) ITS1F (fungi-specific, I'm assuming F does not just stand for "forward" and this is the standard ITS1 primer) sits pretty far back in the 18S, so your pairs might not be quite long enough to bridge the full ITS region. Do you want to figure out the expected amplicon length?

ITSxpress is now available as a Qiime plugin.

To clarify, ITSxpress is not the same software as ITSx.

The two programs work a bit differently. ITSxpress merges reads, and clusters them temporarily at 99.5% identity by default (exact dereplication is also an option) to identify the start and stop sites. Those sites are then applied to all reads in the cluster. Importantly it yields FASTQ files that can be use by Deblur, Dada2 etc. rather than FASTAs.

On a typical 4 core machine ITSxpress is about 23X faster than ITSx for an ITS2 soil sample and 14X faster for an ITS1 soil sample. With default 99.5% clustering, ITSx and ITSxpress trim 99.8% of ITS1 sequences and 99.1% of ITS2 sequences within 2 bases of each other. With 100% clustering ITSXpress is is 6-9x faster and trims 99.99% of ITS1 and 99.86% of ITS2 sequences within 2 bases of each other.

6 Likes

Thanks @Adam_Rivers! I was sort of aware from reading the github page that ITSxpress is intended for fastq data and does much more than ITSx but I did not know how much more! I have edited my response above.

1 Like

Hi Nicholas,

I am attaching the QZV files requested, the demux, trimmed and DADA2.

demultiplexed Illumina files
demux-summary.qzv (300.8 KB)

#Trimmed summary _removal of 3'end primers
trimmed_summary.qzv (306.2 KB)

DADA2 qzv files using 6/299 and 6/290 trim parameters.
table.qzv (375.6 KB)
rep-seqs.qzv (355.0 KB)
denoising-stats.qzv (1.2 MB)

How can I figure out the expected amplicon length, and would this then be my trunc length? I do have an article, here, where it gives me an idea of amplicon lenght ~230bp.

Thank you again

Thank you for the update, I wasn’t sure if I needed to do this, since the tutorial did not mention it, but thanks to your response and Nicholas I know better now.

it looks like the denoising stats did not attach properly (I cannot view). Could you try re-sending or upload to dropbox?

It looks like you are running an earlier version of :qiime2:, and so you don't have the length summaries in demux-summary.qzv. Maybe you found another way to sort out your earlier question but I just wanted to clarify in case you were wondering that my advice about length distributions is based on the most recent release of QIIME 2 when this feature was added:

DADA2 QZV Do these work?

denoising-stats.qzv (1.2 MB)
table.qzv (375.6 KB)

I will check those out, I will also update qiime2 so I can get all the information required.

Thanks! That works.

So my hunch was wrong. This is not a merging problem:

Most sequences are being filtered out at the "filter" step, which seems a little weird because your quality plots looked so nice. Very few are dropped at merging. So you should adjust your truncation parameters to cut out more noise! Still let's discuss merging because the more you truncate the more you risk turning this into a merging problem.

Yep, looks like you figured it out. You could also even consult an agarose gel if you ran these amplicons on a gel (and I might trust that more... 230 sounds really short from my foggy recollection of where ITS1F is positioned in the 18S :older_man: ).

Looks like my recollection is not too foggy after all: 230 is shorter than the average amplicon size you should expect (I wonder how they derived that number! probably the average of their amplicons, which is organism specific, see below). ITS1F is positioned near ITS1F_KYO1 according to this:

Which according to my own dusty old research (see table 2) has an amplicon length of 275.3 ± 103.2 for Ascomycota and 285.3 ± 50.1 for Basidiomycota.

Note also that ITS is hypervariable though (see the SD in those numbers above!), so whatever average we decide on is just that — an average — and depending on the species that are in your samples this could be much higher... and trimming at even full length could cause some species to drop out — this would bias particular organisms, since the length would be particular to the taxon. So check out the amplicon distribution in your own samples if you can to decide on a reasonable upper limit, rather than an average. Use that for deciding how much trimming you can afford.

You need minimum 20 nt for reasonable overlap. And let's say you want to aim for ~480 nt long amplicons as an upper bound (Ascomycota mean length 275.3 nt + (2 standard deviations * 103.2)). So that 480 + 20 nt = 500 nt must be your minimum combined paired-end read length (excluding trimming from the 5' ends).

This gives you lots of wiggle room for truncating your sequences, but it's based on that dusty old table! Check out your actual amplicon length distribution if you can (or if you are not studying anything weird like Glomeromycota just feel free to use that dusty old table).

So you could, say, truncate at 260 forward and 240 reverse and be okay.

You could also experiment: truncate at different lengths and look at your dada2 stats to see where reads begin dropping out at the merging stage.

There is a balance to strike here: the more you truncate, the less you will lose sequences to filtering, but the more you will lose them at merging. Make sense?

I hope that all makes sense, and I certainly hope it helps. :smile:

1 Like

You are amazing thanks for all the information :smile:

I am only looking to find the ectomycorrhizal fungi in my samples, but I am not sure what I will find, so I want to conserve as much as my data as possible.

Last night after doing some online research, I did notice that the filtering step removed a lot of my data, but, I could not figure out how to deal with it, but I will update both the VM and Qiime2 versions and rerun my demux summary and then rerun DADA2.

I will keep you updated, as I’ve asked a million questions and you’ve given me such valuable information, that I’m sure will be helpful for other users.

Thanks again :grinning:

1 Like

Hi Nicholas,

Quick update, I hope you get this before you leave for the day.

Based on this, my average appears to be 301 nts, but am a little confused on how to use this value to get the trim/trunc parameters. I am still researching and trying to understand your above post, but I wanted to send this over in case you were still in your office. (This are the demultiplexed samples, 5'end amplicons and primers removed, 3' end still attached). I am trimming the 3'end and can submit that summary in a bit.

301 nt is the read length, not the amplicon length.

This summary will be useful for figuring out if cutadapt successfully trims the reads, and how much trimming occurs; but is not useful for figuring out the amplicon length.

I hope that helps clarify! Let me know if anything else is unclear.

Hi Nicholas,

So I had a fun weekend, I performed various DADA2 runs with different trim/trunc lengths, starting with the 299/290, 290/280, going down to your suggestion 260/240, 200/200 and lastly 150/150 (found a tutorial by Kenedy's lab (Univ of Minnesota) where he trims the ends at 150bp for both F/R ends. Based on this, I looked over my trimmed-demux summary data and the average length of my samples was 137bp/149bp. I just wanted to run the numbers by you and see what you thought.

I am currently running a 250/250, in case, but the runs above 200 (prior to the one currently running), failed and looked very similar to the one above. Any comments are highly appreciated.

150/150 Trim/trunc Had originally said this was the 150/150 values, but I mislabelled them, the 150/150 is the bottom values this is 200/200.

200/200 Trim/Trunc

Thanks

2 Likes

150/150 seems overly stringent here — but does not result in so many fewer merged reads, so most of your amplicons are probably on the short side (i.e., centering around that mean amplicon length of 275.3 nt, which would be fully covered by overlapping 150X PE reads)

Since your data appear to be reasonably high quality, I would personally stick with less truncation (i.e., longer sequences) just in case there are some longer amplicons in there — don’t want those to drop out due to insufficient merging.

In any case, 200/200 looks good and seems to have more merged reads output than 150/150 (consistent with my worries that some of the longer amplicons may be failing to merge with 150/150). If your other tests yield more merged reads, go with that run!

Sounds like you’ve done a good little benchmark here to figure out what works best for your data! The good news is all steps after denoising should get “easier” (i.e., less fiddling with data required :violin:)

I hope that helps!

1 Like

Perfect, thank you for explaining the results again. I definitely do not want to lose any data, so as soon as the other run goes through, I will compare the merge reads. and make my decision based on your recommendation.

Thanks again :smile:

1 Like