Hello,
I'm working with fungal ITS sequencing for the first time and I notice that there is not a general consensus on which work flow to use, in the context of short-read sequencing such as Illumina sequencing. Especially about denoising and clustering, there seem to be questions (e.g. a recent post about ASV vs OTU for fungal ITS
In 2018 an ITS tutorial was posted by the QIIME team, but as clustering or denoising challenges are not discussed there, I think it should be updated.
My current workflow
In summary, I currently use following workflow:
- Import raw read data
- use
itsxpress
to extract the ITS region from the reads - use
dada2
to denoise (without truncation) and merge the reads - cluster on 97% identity using
vsearch
- use the pre-trained UNITE QIIME release of @colinbrislawn for taxonomic identification
Below you can find the exact code that I use
QIIME2 workflow code
#extract ITS sequencing from raw data, no clustering
qiime itsxpress trim-pair-output-unmerged \
--i-per-sample-sequences raw_reads.qza \
--p-region ITS2 \
--o-trimmed itstrimmed_reads.qza
#denoise using dada2
qiime dada2 denoise-paired \
--i-demultiplexed-seqs itstrimmed_reads.qza \
--p-trunc-len-f 0 \
--p-trunc-len-r 0 \
--o-representative-sequences rep-seqs-dada2.qza \
--o-table table-dada2.qza \
--o-denoising-stats stats-dada2.qza
#cluster the reads on 97%
qiime vsearch cluster-features-de-novo \
--i-sequences rep-seqs-dada2.qza \
--i-table table-dada2.qza \
--p-perc-identity 0.97 \
--o-clustered-table table-dada2-0_97_clust.qza \
--o-clustered-sequences rep-seqs-dada2-0_97_clust.qza
#Add taxonomy to the Feature sequences no cluster
qiime feature-classifier classify-sklearn \
--i-classifier UNITE_pretrained_colinbrislawn/unite_ver10_99_all_04.04.2024-Q2-2024.2.qza \
--i-reads rep-seqs-dada2-0_97_clust.qza \
--o-classification taxonomy.qza
I'd like to go over each of the steps and hope to receive input/your thoughts about them. I use bullets to indicate my questions/points of discussion.
ITS read extraction
As mentioned by Tedersoo et al., 2022 it seems important to extract only the ITS region from the reads, thereby removing the flanking regions.
I currently extract ITS regions from the raw reads, but e.g. Nguyen et al., 2024 perform this step after dada2
denoising. Though an earlier pipeline from this group does the trimming before dada2
denoising.
The reason I trim before dada2
is that itsxpress
only has the options trim-pair
for use with Deblur
and trim-pair-output-unmerged
for use with dada2
. It is not possible to input reads that are merged and denoised with dada2
into itsxpress
. Therefore it is neccessary to do ITS extraction before dada2.
- ITS extraction before denoising appears to work well, as I obtain a higher proportion of filtered/merged reads from
dada2
. Agreed that this makes sense?
Denoising
Tedersoo et al., 2022 state the Deblur
cannot be used for denoising ITS reads due to the variability in read length. dada2
can be used to remove low quality reads/chimera's/.. and merge the reads, but it is possibly wise to cluster thereafter (see next section).
- An alternative option to would be to use various standalone programs to filter on quality, merge reads, and detect chimeras. Would this be preferred over
dada2
for fungal ITS reads?
Clustering: ASV/ESV vs OTU
In a great earlier clustering discussion, the general conclusion was to use ASV/cluster at 100% in the context of 16S/18S sequences. This is also what I always do for 16S and 18S sequences. However, we are now talking about ITS.
Due to its variability, clustering is current-day often applied at 97%, as seen in recent papers, e.g.:
- https://enviromicro-journals.onlinelibrary.wiley.com/doi/10.1111/1758-2229.12776
- Weather in two climatic regions shapes the diversity and drives the structure of fungal endophytic community of bilberry (Vaccinium myrtillus L.) fruit - PMC
and also Tedersoo et al., 2022 state
the ESV approaches are certainly useful for separating as many species/haplotypes as possible based on conserved genes, but their utility for ITS and protein-coding genes is unclear (Antich et al., 2021). They may outperform traditional OTU clustering approaches in distinguishing very closely related species of Ascomycota with haploid genomes. However, an ESV approach severely biased species richness estimates of metazoans based on the cytochrome oxidase 1 (CO1) gene (Antich et al., 2021; Brandt et al., 2021), and it is expected to perform poorly for fungal groups with dikaryotic (Basidiomycota), diploid (most unicellular groups) or polyploid (Glomeromycota) genomes that commonly exhibit two or multiple different rRNA gene and ITS copies per genome or even within haploid nuclei
and based on their analyses, they also concluded:
the ESV approaches recover lower proportions of non-Dikarya and nonfungal taxa compared with traditional approaches;
Thus, I think the general consensus under mycologists is still to cluster for this region, probably at 97%. However, not everyone agrees with this, see e.g. ASV vs OTU for fungal ITS
- I now perform clustering with
vsearch
afterdada2
denoising, and this makes sense right? In the end,dada2
produces filtered/merged exact reads.
Taxonomic identification
The UNITE database seems the golden standard for ITS identification and @colinbrislawn has kindly provided pre-trained UNITE QIIME2 releases. However, there are several UNITE releases (97%,99%, dynamic clustering) and I'm completely sure which to use..
-
If you cluster your reads at 97%, would it also be wise to use the 97% cluster UNITE database or would it also be OK to use the 99% clustered database? On the pre-trained QIIME unite databases page of @colinbrislawn it is stated that the use of the 97% database is not recommended.
-
My initial thoughts would say that it is perfectly fine to use the 99% clustered database when using 97% reads, as it is about comparing sequence alignments and not clustering raw-reads together. What do you think?
I know there is some overlap with the earlier topic ASV vs OTU for fungal ITS but I hope this post provides additional insights and sparks a discussion regarding the optimal workflow for ITS read processing.
Thanks!
PS: I'm on vacation next week, so I'll respond when I'm back. Please do not close the topic in the mean while