I am hoping for some feedback on my current methods for processing fungal ITS data in QIIME2, as stated below. In particular, I am curious if my application of ITSx for removing non-fungal sequences is appropriate. Any other suggestions/discussion regarding the workflow would also be appreciated and useful to the community, I think. I’m hoping this post can also serve to compile some of the information I’ve found in other posts.
I am currently working with a paired-end Illumina (2 x 250) data set targeting the ITS2 region of fungi (ITS3-ITS4 primer set). My data was supplied from the sequencing facility as already-demultiplexed fastq files (one forward and one reverse per sample). Illumina adapters are not present at the 5’ end of my forward or reverse reads (only locus specific PCR primer sequences remain), but reads for shorter ITS sequences do have ‘read-through’ and, therefore, PCR primer and Illumina adapter sequences at the 3’ end.
Import data using
qiime tools import
Remove PCR primer sequences from both ends of the forward and reverse reads using
qiime cutadapt trim-paired. I have not been able to find a firm answer on this, but I am removing PCR primer sequences, as they may not represent actual biological sequences due to the potential that PCR primers annealed to my target DNA with a mismatch, or two. This also addresses read through as the 3’ end of the reads are truncated at the beginning of the reverse compliment of the PCR primer (if found), thereby eliminating the subsequent Illumina adapter, etc.
qiime dada2 denoise-paired
Run ITSx on the representative sequences to generate a list of feature IDs associated with representative sequences identified as fungal ITS (this involves using awk to extract feature IDs from the ITSx output)
Filter feature table against the list of fungal ITS feature IDs obtained from ITSx using
qiime feature-table filter-features
Proceed with downstream analysis (assign taxonomy, diversity analysis) using filtered feature table
I do lose about two-thirds of my reads through dada2. Others have suggested merging paired ends using PEAR prior to denoising in dada2, however, it is uncertain whether this affects error-handling or any other features in dada2. It has also been suggested that maxEE filtering parameter can be relaxed, which may help to retain some of my longer reads where the end of the reverse read is likely to be of lower quality. I will try this in addition to pre-merging, but was wondering if anyone has additional thoughts/experiences beyond what is contained in the linked post.
I am using ITSx mainly as a tool to eliminate non-fungal sequences from my data set (although it has the additional function of trimming 5.8S head and LSU tail from the representative sequences). I have seen that
qiime quality-control exclude-seqsmay also be useful for excluding non-target DNA. However, in the case of fungal data, I am hesitant to use this approach because there may be novel fungi in my system that are not represented in the UNITE database and I want to avoid excluding them. Maybe the use of ITSx vs exclusion by similarity to exclude non-target DNA is a larger question here?