fungal ITS analysis with QIIME2: current recommended workflow?

Rob_DNA · July 12, 2024, 9:10am

Hello,

I'm working with fungal ITS sequencing for the first time and I notice that there is not a general consensus on which work flow to use, in the context of short-read sequencing such as Illumina sequencing. Especially about denoising and clustering, there seem to be questions (e.g. a recent post about ASV vs OTU for fungal ITS

In 2018 an ITS tutorial was posted by the QIIME team, but as clustering or denoising challenges are not discussed there, I think it should be updated.

My current workflow
In summary, I currently use following workflow:

Import raw read data
use itsxpress to extract the ITS region from the reads
use dada2 to denoise (without truncation) and merge the reads
cluster on 97% identity using vsearch
use the pre-trained UNITE QIIME release of @colinbrislawn for taxonomic identification

Below you can find the exact code that I use

QIIME2 workflow code

#extract ITS sequencing from raw data, no clustering
qiime itsxpress trim-pair-output-unmerged \
--i-per-sample-sequences raw_reads.qza \
--p-region ITS2 \
--o-trimmed itstrimmed_reads.qza

#denoise using dada2
qiime dada2 denoise-paired \
--i-demultiplexed-seqs itstrimmed_reads.qza \
--p-trunc-len-f 0 \
--p-trunc-len-r 0 \
--o-representative-sequences rep-seqs-dada2.qza \
--o-table table-dada2.qza \
--o-denoising-stats stats-dada2.qza


#cluster the reads on 97%
qiime vsearch cluster-features-de-novo \
--i-sequences rep-seqs-dada2.qza \
--i-table table-dada2.qza  \
--p-perc-identity 0.97 \
--o-clustered-table table-dada2-0_97_clust.qza \
--o-clustered-sequences rep-seqs-dada2-0_97_clust.qza

#Add taxonomy to the Feature sequences no cluster
qiime feature-classifier classify-sklearn \
  --i-classifier UNITE_pretrained_colinbrislawn/unite_ver10_99_all_04.04.2024-Q2-2024.2.qza \
  --i-reads rep-seqs-dada2-0_97_clust.qza \

  --o-classification taxonomy.qza

I'd like to go over each of the steps and hope to receive input/your thoughts about them. I use bullets to indicate my questions/points of discussion.

ITS read extraction
As mentioned by Tedersoo et al., 2022 it seems important to extract only the ITS region from the reads, thereby removing the flanking regions.

I currently extract ITS regions from the raw reads, but e.g. Nguyen et al., 2024 perform this step after dada2 denoising. Though an earlier pipeline from this group does the trimming before dada2 denoising.

The reason I trim before dada2 is that itsxpress only has the options trim-pair for use with Deblur and trim-pair-output-unmerged for use with dada2. It is not possible to input reads that are merged and denoised with dada2 into itsxpress. Therefore it is neccessary to do ITS extraction before dada2.

ITS extraction before denoising appears to work well, as I obtain a higher proportion of filtered/merged reads from dada2. Agreed that this makes sense?

Denoising
Tedersoo et al., 2022 state the Deblur cannot be used for denoising ITS reads due to the variability in read length. dada2 can be used to remove low quality reads/chimera's/.. and merge the reads, but it is possibly wise to cluster thereafter (see next section).

An alternative option to would be to use various standalone programs to filter on quality, merge reads, and detect chimeras. Would this be preferred over dada2 for fungal ITS reads?

Clustering: ASV/ESV vs OTU
In a great earlier clustering discussion, the general conclusion was to use ASV/cluster at 100% in the context of 16S/18S sequences. This is also what I always do for 16S and 18S sequences. However, we are now talking about ITS.

Due to its variability, clustering is current-day often applied at 97%, as seen in recent papers, e.g.:

and also Tedersoo et al., 2022 state

the ESV approaches are certainly useful for separating as many species/haplotypes as possible based on conserved genes, but their utility for ITS and protein-coding genes is unclear (Antich et al., 2021). They may outperform traditional OTU clustering approaches in distinguishing very closely related species of Ascomycota with haploid genomes. However, an ESV approach severely biased species richness estimates of metazoans based on the cytochrome oxidase 1 (CO1) gene (Antich et al., 2021; Brandt et al., 2021), and it is expected to perform poorly for fungal groups with dikaryotic (Basidiomycota), diploid (most unicellular groups) or polyploid (Glomeromycota) genomes that commonly exhibit two or multiple different rRNA gene and ITS copies per genome or even within haploid nuclei

and based on their analyses, they also concluded:

the ESV approaches recover lower proportions of non-Dikarya and nonfungal taxa compared with traditional approaches;

Thus, I think the general consensus under mycologists is still to cluster for this region, probably at 97%. However, not everyone agrees with this, see e.g. ASV vs OTU for fungal ITS

I now perform clustering with vsearch after dada2 denoising, and this makes sense right? In the end, dada2 produces filtered/merged exact reads.

Taxonomic identification
The UNITE database seems the golden standard for ITS identification and @colinbrislawn has kindly provided pre-trained UNITE QIIME2 releases. However, there are several UNITE releases (97%,99%, dynamic clustering) and I'm completely sure which to use..

If you cluster your reads at 97%, would it also be wise to use the 97% cluster UNITE database or would it also be OK to use the 99% clustered database? On the pre-trained QIIME unite databases page of @colinbrislawn it is stated that the use of the 97% database is not recommended.
My initial thoughts would say that it is perfectly fine to use the 99% clustered database when using 97% reads, as it is about comparing sequence alignments and not clustering raw-reads together. What do you think?

I know there is some overlap with the earlier topic ASV vs OTU for fungal ITS but I hope this post provides additional insights and sparks a discussion regarding the optimal workflow for ITS read processing.

Thanks!

PS: I'm on vacation next week, so I'll respond when I'm back. Please do not close the topic in the mean while

salias · July 12, 2024, 10:42am

Hello @Rob_DNA ,

There is a lot to unroll here. I'm not an expert, but my first steps with metataxonomy are precisely with ITS sequences. So I will share my thoughts on the points on which I feel "confident" enough to discuss them.

Agreed. Since ITS sequences vary notably in length (they are indel-rich regions) I use ITSxpress for two purposes:

The original purpose described in the tutorial:

My other purpose: dynamic quality filtering. I don't want to truncate sequences to a fixed length because of the nature of ITS sequences. So I take advantage of the fact that ITSxpress is trimming sequence sections not belonging to ITS, which turn out to be sections with low quality. Consequently, DADA2 works much better without me having to specify further truncation or trimming parameters. I don't know if this is the best way to do things, but at least it's the way that works best for me.

I am an advocate of the idea that for ITS sequences we should use ASVs without further clustering. On this note, I really like this Colin answer in the post you cite.

Although I also use Colin's pre-trained classifiers, I'm afraid I cannot help here since I assign taxonomy to unclustered ASVs. However, just in case you want to know, I'm currently using 99%, all eukaryotes, version without "s"¹ database.

--

¹ Even after reading release descriptions in the UNITE webpage, I'm not sure about what the difference is between the versions with and without "s".

Rob_DNA · July 12, 2024, 11:27am

Hi @salias,

thank you for the contribution!

Exactly, that is very convenient. That is also the reason that I do not truncate with dada2

I indeed read the answer of Colin and can follow the line of reasoning. If I interpret correctly what he means, it boils down to: ASV are just the sequences you find, and you can identify these.

However, a lot of fungal expert do the clustering and also e.g. Tedersoo et al. 2022 show a sign. lower fungal richness when using ASV's as compared to clustering... This surprises me actually, I'd expect a higher richness using ASVs (=100% clustering), as you get more separate sequences.

Yeah it can be quite tricky to understand what is exactly means and moreover when you should choose which... Perhaps somebody can elaborate on this? I see that @colinbrislawn also says

Includes global and 97% singletons.(I'm not sure what that means)

colinbrislawn · July 12, 2024, 2:39pm

Thank you for continuing this discussion. I'm glad folks are thinking about this deeply as we have been since 2019.

I didn't make the 97% databases for a while, as I don't think these should be used for taxonomy.
But folks asked for them

Let's add this: much of amplicon analysis is based on bacteria, Illumina, and the 16S gene.
Like, four tutorials on this page use the 16S V4 region.
Assumptions lead to mistakes.

Variable-length genes break all sorts of assumptions. @Robert_Edgar wrote that up too:
https://drive5.com/usearch/manual/global_trimming_and_abundance.html

Nicholas_Bokulich · July 12, 2024, 2:52pm

We have compared performance of these clustering thresholds here:

Notably, 97% is worse for classification accuracy than 99%, so what does that imply for OTU clustering after denoising? The dynamic clustering used by UNITE is somewhere between these, reflecting the fact that taxonomic naming of species is neither of these, which gets back to the points made by @colinbrislawn and others on that other general discussion thread, that species hypotheses, OTU clustering, and denoising approaches should not all be conflated. If you wish to perform OTU clustering to use OTUs as a feature for measurement, that is fine, and this probably makes most sense for more accurate alpha diversity measurements. If you wish to evaluate sub-species level variation and are more interested in beta diversity, then OTU clustering will reduce your sensitivity.

Rob_DNA · July 22, 2024, 12:00pm

I agree @Nicholas_Bokulich, that's why I made separate sections of "denoising" and "clustering".

Good question and I understand your point.

This is an interesting point. When writing a paper, one could also do both clustering and using ASV/ESV/.. for separate analyses. If you at least explain why you do this, it could be a valid option.

Rob_DNA · July 22, 2024, 12:07pm

this is indeed a bit striking and I think it would be a good addition if the QIIME2 team also add 'verified' tutorials on other markers like ITS. (there is one from 2018 on the QIIME2 forum but I feel like that one could be updated, though it is greatly appreciated).

system · August 22, 2024, 6:07pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.