trimming/truncation of ITS sequences

ljiraska · April 26, 2019, 12:08pm

Hi!

Thanks @RCM for starting this thread and thanks @ebolyen for answering.

I'm new to QIIME2 as well and I'm trying to figure out the correct parameters, so I have a question about trim and trunc as well. When I'm using Illumina tagged primers (as following):

Forward overhang: 5’ TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG‐[locus specific
sequence]

Do I trim only the tag or should I cut the whole primer (including locus specific sequence?)

And one question about the trunc-len:
How much can I truncate? (considering I have reads 2x300 as well, and considering that my reads are good quality as well at least till +-280) for 16S, ITS and COI?

And,
should I cut primers when processing ITS sequences if I want to UNITE to assign taxonomy? (I found somewhere on GitHub or somewhere here on forum - sorry, I cannot find it again - that UNITE classifier will not benefit from cutting the primers... not sure what to think about that).
Or maybe to use ITS-extractor?

So many options...

Thanks!!

Nicholas_Bokulich · April 26, 2019, 1:40pm

You want to trim all of that before denoising. You do not want PCR primers present in your reads during denoising or OTU clustering.

So if you are using dada2 or deblur, trim-left the full length.

If you are using qiime cutadapt trim-* you can trim using the full sequence (as long as there are no sample-specific barcodes in the middle) or use the DNA-binding section of the PCR primer, since cutadapt will trim away everything to the 5' end of the front adapter and 3' end of the adapter adapter. (I would advise using q2-cutadapt particularly for your ITS sequences, so that you can also trim the reverse primer out of your reads; search for read-through ITS issues on this forum to see more discussion)

This is a bit of a trial-and error process, especially for hypervariable-length amplicons like ITS. Truncate wherever seems reasonable based on your quality profile (search the forum archive for specific tips), and look at the output results to make sure you are getting enough joined ("merged") reads. Losing reads at the joining/merging step is bad; losing reads at the filtering step is okay (I am using the parlance found in the dada2 stats output here). Adjust truncation parameters if you are losing too many reads at filtering (trim more) or at merging (trim less).

always cut primers

wherever you read it, I probably wrote it. The UNITE database contains a mixture of different sequences amplified with different primers, and different domains: ITS1, ITS2, or full ITS. The standard database uses ITSx to extract only the ITS domain(s), no primers. The developer database contains the full sequence deposited. SO attempting to extract reads must be done on the developer database, and you may run into a few different problems: you will lose many reads because the primers do not match; this may be a good thing (e.g., you do not want ITS2 domain only sequences if you are using ITS1 domain primers) or it may be a bad thing (e.g., your primers are not hitting true matches because your primers hit sites external to the reads deposited in the database). In my tests, I have found that training classifiers on the full UNITE database does better than on extracted reads, but I am certain there is room for improvement.

Yes! This is another option. See the q2-itsxpress plugin (there is a tutorial for it on this forum). You can use that to extract the ITS domain from your own sequences. You can then use these with the UNITE standard sequences, though you should filter these to use only the ITS domain that matches your amplicons.

Good luck!

ljiraska · April 29, 2019, 12:55am

Thanks a lot for a detailed answer @Nicholas_Bokulich!

That helped me a lot!