ITS data in QIIME2

mycol · May 9, 2018, 3:25pm

I am hoping for some feedback on my current methods for processing fungal ITS data in QIIME2, as stated below. In particular, I am curious if my application of ITSx for removing non-fungal sequences is appropriate. Any other suggestions/discussion regarding the workflow would also be appreciated and useful to the community, I think. I'm hoping this post can also serve to compile some of the information I've found in other posts.

I am currently working with a paired-end Illumina (2 x 250) data set targeting the ITS2 region of fungi (ITS3-ITS4 primer set). My data was supplied from the sequencing facility as already-demultiplexed fastq files (one forward and one reverse per sample). Illumina adapters are not present at the 5' end of my forward or reverse reads (only locus specific PCR primer sequences remain), but reads for shorter ITS sequences do have 'read-through' and, therefore, PCR primer and Illumina adapter sequences at the 3' end.

Workflow:

Import data using qiime tools import
Remove PCR primer sequences from both ends of the forward and reverse reads using qiime cutadapt trim-paired. I have not been able to find a firm answer on this, but I am removing PCR primer sequences, as they may not represent actual biological sequences due to the potential that PCR primers annealed to my target DNA with a mismatch, or two. This also addresses read through as the 3' end of the reads are truncated at the beginning of the reverse compliment of the PCR primer (if found), thereby eliminating the subsequent Illumina adapter, etc.
Denoise using qiime dada2 denoise-paired
Run ITSx on the representative sequences to generate a list of feature IDs associated with representative sequences identified as fungal ITS (this involves using awk to extract feature IDs from the ITSx output)
Filter feature table against the list of fungal ITS feature IDs obtained from ITSx using qiime feature-table filter-features
Proceed with downstream analysis (assign taxonomy, diversity analysis) using filtered feature table

Additional thoughts/information:

I do lose about two-thirds of my reads through dada2. Others have suggested merging paired ends using PEAR prior to denoising in dada2, however, it is uncertain whether this affects error-handling or any other features in dada2. It has also been suggested that maxEE filtering parameter can be relaxed, which may help to retain some of my longer reads where the end of the reverse read is likely to be of lower quality. I will try this in addition to pre-merging, but was wondering if anyone has additional thoughts/experiences beyond what is contained in the linked post.
I am using ITSx mainly as a tool to eliminate non-fungal sequences from my data set (although it has the additional function of trimming 5.8S head and LSU tail from the representative sequences). I have seen that qiime quality-control exclude-seqs may also be useful for excluding non-target DNA. However, in the case of fungal data, I am hesitant to use this approach because there may be novel fungi in my system that are not represented in the UNITE database and I want to avoid excluding them. Maybe the use of ITSx vs exclusion by similarity to exclude non-target DNA is a larger question here?

SoilRotifer · May 11, 2018, 4:00pm

Hi @mycol,

Your overall strategy appears fine.

I assume when taking into account reading through into the opposing primer at the 3' end of the reads you are doing so by feeding cutadapt the reverse-compliment of that opposing primer sequence? I ask because, it is often forgotten to take the reverse compliment. If not, this may explain why so few reads are merged/returned via DADA2, as the opposing primer is contained within both of the R1 and R2 reads. Also, make sure you've not trimmed to much off of the end of your reads. I've had problems when many of my ITS reads were to long to merge regardless if the data were clean. I refer you to this paper to help you decide if you should use only the forward reads or attempt to merge them.

One thing of note, it has been recommended to me a long while ago by one of the ITSx authors to try leaving the primer sequences in your data prior to using ITSx. Note, for this particular case it means that you'd only try trimming the opposing primer that has been read through, and you'd leave the 5' primer in each of the reads. Then merge them.

The primer sequences make it easier for ITSx to find and extract the region of interest, potentially retaining more reads from ITSx. I am thinking it may be possible to merge the reads via vsearch after you trim the 3' primer sequences, and then run ITSx on the resulting FASTA output. Anyway, I only wanted to mention this for an added means of sanity checking other steps should you need to. Otherwise continue as you've done, trim all the primers, then use ITSx. Then use vsearch and de blur.

The short of it, I would suggest remaining within qiime for now (i.e. use the vsearch plugin to merge reads) and then make use of deblur. Then determine how many reads to retain or lose compared to DADA2.

-Hopefully this makes sense.
-Best