Trim/trunc length for ITS

Fabs · July 16, 2018, 12:52am

Hello,

I have a quick question regarding the trim/trunc length parameters when working with Illumina paired-end sequences for the ITS regions (ITS1F & ITS2). My data was demultiplexed and I believe that the primers and adapters have been removed, but I am not 100% sure. (Can you tell me how I can verify this info?).

I was able to import and do the quality control for my data, but I have questions in terms of understanding the proper trimming perimeters. I have looked at a few threads and looked at the tutorials and whatnot, but still, there is not enough information for me to understand how to properly make the decision and why such lengths are chosen.

I have attached a copy of my quality graphs, and per this graphs, I considered the trim = 5 and trunc = 240 (I did consider 300 sincec the foward reads were pretty good, but the reverse ones dip down quite a bit, so I decided for 240). Some forums online state that for ITS data, the trunc value should be left out, so that we avoid removing values for certain clades. Is this correct?

Can anyone please help me and or guide me on how to make this decision? I also want to understand this decision making process so if there are any link to papers, resources or forums that can help me understand the trimming decision for ITS data, I would really appreciate it.

Demultiplexed sequence counts summary
Minimum: 11735
Median: 18472.0
Mean: 19693.375
Maximum: 31266
Total: 315094

Foward reads

Reverse reads

Lastly, besides the tutorials, can I get some links to where I can read on understanding the feature table summaries?

Thank you guys for all your help

Mehrbod_Estaki · July 17, 2018, 8:44pm

Hi @Fabs,

One easy way is to simply ask your sequencing facility what has been done to the reads and whether or not primers have been removed from both directions. Usually with 16S data you could just use something like head/tail to look at your fastq files and look for your primers at the beginning of your reads, however with ITS data this is a bit trickier. Due to the large size variation of ITS region it is common to have length reads (300 in your case) that are longer than the amplicon size (ex 150bp) which leads to your reads going into the reverse primers on the 3’. You’ll want to deal with those by removing the opposing primers using cutadapt, just make sure that you use the reverse compliment of your primers.
There’s a good discussion/workflow for a similar case here that might be useful and another one here as well.

The reason why it may seem this way is because there are quite a few factors involved in this decision making process preventing it from having a one size fit all approach, and at the end of the day it really depends on your data and what your end goal is. The dada2 tutorial and moving pictures tutorial do provide some guidance but ultimately you'll have to select them based on your data. For example, a general rule for minimal quality is to truncate where the median score drops below 20 in your quality plots. See here for the definition of these phred-like scores. An additional consideration for paired-end reads is to ensure there is enough overlap between your reads after truncating to ensure proper merging, minimum of 20bp is adequate overlap.

Remember you don't have to set the truncating length to the same for Forward and Reverse reads. So for example in your case I would set the Forward reads to ~295 and the Reverse reads to ~ 280. Reverse reads dipping in quality is very typical of Illumina runs. The good news is that your quality plots look excellent! I'm very jealous! So you probably don't need to worry about truncating and merging issues all that much.

Could you provide us with the links to these discussions? This is somewhat of an untested territory so we wanted to look into this a bit more in detail.

There isn’t currently a separate tutorial regarding the feature-table summaries, is there a particular question you have or area where you’d like more clarification?

Hope this answers some of your questions.

Fabs · July 17, 2018, 10:11pm

Hi Mehrbod

Thank you for your in depth explanation.

I found out, from the sequencing lab, that in fact the 3' primers have not been removed, so I do have to run cut-adapt trim paired on my samples before continuing with DADA2. Therefore, I'm assuming that the large dip will not be as prominent once this is done, but would you mind if I message you with my complete data (graphs) once I process them so that I can verify the trimming parameters I chose. (Note: this is only a very small portion of my samples, I just wanted to get the coding done properly and make sure it worked, before running 216 samples).

In terms of the forum where it says not to trim, I will look for it and send it to you with my next post, if that is okay, as this was, in fact, confusing for me to read as I was sure I had to include such parameters.

Fabs · July 17, 2018, 10:19pm

Sorry, one last question.

You mentioned that for the CutAdapt portion, you mentioned that I have to use the reverse compliment. I though I was suppose to determine such values (as you can tell, I am a first time user of Qiime and as well, it is my first time doing sequencing), but I looked at the information I provided the sequencing lab, and the reverse compliment barcode was provided, can you verify for me that this is, in fact, the information I need to run cutadapt?

Please see image below.

Once again, thank you for your help!

Mehrbod_Estaki · July 18, 2018, 12:41am

Glad you found them useful @Fabs!

Not at all! But if you're ok with sharing these on this thread it would be helpful for others in a similar situation. It also has the added benefit of other experts :qiime2:ing in. If its a privacy issue though I understand.

If I'm not mistaken these are the barcodes used for multiplexing and are likely already removed during the demultiplexing, whereas you want to be feeding cutadapt the PCR primers sequences.

Fabs · July 18, 2018, 1:04am

Thanks again! Okay so am a little confused. So on the data I submitted for sequencing I have the following info
barcode (n/a)
Pad
Linker
Gene Primer (ITS2) reverse primer and the complete sequence (w/o pad)

as well as another line containing a value called RC Gn Primer, RC lin, RC pad and complete sequence w/o pad.

Based on this info, can you possibly guide me?

Thanks again. I did email the lab I worked out of, but I have yet to hear from her and I would really like to get this info processed.

Fabs · July 18, 2018, 12:06am

Hi Thermokarst,

Quick question,

I just found out I have to trim the adapters on the 3' end of both my forward and reverse reads. I want to make sure that I have the correct information to runt the code. Do you mind taking a look at it and letting me know what you think? I have yet to run it but I figured I'd ask prior to.

qiime cutadapt trim paired
--i- demultiplexed-sequences demux-paired-ends.qza
--p-cores (NOT SURE HOW TO DETERMINE THIS VALUE)
--p-adapter f
--p-adapter r
--p-error-rate 0
--verbose

For p adapter f (forward) and p adapter r (reverse), would I use the same reverse complimentary string? Per the sequencing facility, the primer adapter sequences have been removed from the beginning of reads so I do not need further trimming on the 5'.

Please let me know as I am a bit confused and have not found a forum containing an example (tutorial) using cutadapt trim paired ends for me to follow.

Nicholas_Bokulich · July 18, 2018, 3:47pm

No. Unless if I misunderstand, you are concerned about read-through in both forward and reverse directions. So --p-adapter-f should be the reverse complement of the reverse primer, and --p-adapter-r should be the reverse complement of the forward primer.

Make sure that it matches the orientation of the primer as it would appear on each read. Just to be safe, I would manually inspect the sequences to make sure you are feeding in the primer sequences in the correct orientation. To see the first few sequences and inspect, use this command:

head sequences.fasta

If you know how many cores you have available on your system, you can put that number there to run this command in parallel. If you do not know, just ignore that parameter.

I hope that helps!

Fabs · July 18, 2018, 6:40pm

I see, okay I will leave that parameter blank as I am unsure.

In terms of the other 2 parameters, I just know that the 3’ have not been removed so then I would only do the --p-adapter-r (Reverse) correct?

Additionally, I am still trying to verify the reverse primer that I need to use, per what I submitted to the company I used the ITS1F and ITS2 but am not sure which values I need. I haven’t gotten a response from the person who helped me, so I figured I’d ask you. Could you please take a look and maybe help me out?

Thanks a million, again, and sorry am bombarding you with so many questions.

Nicholas_Bokulich · July 18, 2018, 6:54pm

Your earlier comment made it sound as though the adapters/primers have been removed from the 5' ends of both forward and reverse sequences. But there could be read-through in either read. So use --p-adapter-r AND --p-adapter-f to look for the other primer within each sequence (looking for ITS2 in your forward reads and ITS1f in your reverse reads).

You want the sequences in the "Gene Primer" column — just ITS1f and ITS2. Both may need to be reverse complemented to match the orientation in they would be found in in the opposite sequence reads — manually inspect the first few sequences to be sure.

It actually looks like you can input multiple values for both --p-adapter-f and --p-adapter-r (cc: @thermokarst), so you can just put in the complement and reverse complement sequences to be sure that both are caught. So do something like the following:

qiime cutadapt trim paired \
  --i-demultiplexed-sequences demux-paired-ends.qza \
  --p-adapter-f [put ITS2 sequence here]\
  --p-adapter-f [put RC ITS2 sequence here]\
  --p-adapter-r [put ITS1f sequence here]\
  --p-adapter-r [put RC ITS1f sequence here]\
  --p-error-rate 0 \
  --verbose \

Fabs · July 18, 2018, 7:06pm

Thank you for your quick response Nick.

I will try that now, is there a way or a code I can run to verify that, in fact, the primers were removed?

To clarify on the first question, per the sequencing lab, I was told the following "The primer and adapter sequences have been removed from the beginning of your reads, so the reads should start with exactly the first base downstream of the gene priming sequence. No further trimming from the 5' ends of the sequences should be necessary. " So its only the 3'end that I have to deal with.

Thanks again, I really appreciate all your help.

Nicholas_Bokulich · July 18, 2018, 7:34pm

yes! Use demux summarize and a summary of sequence length distributions appears on the second tab (titled "Interactive Quality Plot" — scroll to the bottom). Just run that on your sequences pre- and post-trimming.

Note that this distribution is sampled randomly from a subset of sequences, so may not reflect the total distribution and some variation will occur anyway — but it should be representative.

If you do not see any trimming, either you don't have read-through problems, or something went wrong.

Fabs · July 18, 2018, 8:58pm

Perfect, thank you.

Last question, I imported my data, prior to realizing that the 3' end needed to be trimmed, with 'SampleData[PairedEndSequencesWithQuality]', is this still correct or do I need to use MultiplexedPairedEndBarcodeInSequence as your tutorial suggested?

Note: My samples are Illumina paired-end w quality scores.

Nicholas_Bokulich · July 18, 2018, 9:01pm

$ qiime cutadapt trim-paired --help
Usage: qiime cutadapt trim-paired [OPTIONS]

  Search demultiplexed paired-end sequences for adapters and remove them.
  The parameter descriptions in this method are adapted from the official
  cutadapt docs - please see those docs at https://cutadapt.readthedocs.io
  for complete details.

Options:
  --i-demultiplexed-sequences ARTIFACT PATH SampleData[PairedEndSequencesWithQuality]
                                  The paired-end sequences to be trimmed.
                                  [required]

Looks like you imported as the correct type. That tutorial shows demultiplexing then trimming with cutadapt; your data are already demultiplexed and only need to be trimmed, correct?