Fungal ITS analysis tutorial

Fungal ITS analysis, mock communities, and more fun

NOTE: This tutorial was written in a QIIME 2 2018.11 environment. It is not guaranteed to work with earlier or later versions of QIIME 2. This tutorial was compiled as a working exercise for a QIIME 2 workshop in December 2018, and does not represent the only possible fungal ITS workflow with QIIME 2, or even a benchmarked protocol recommendation. See other tutorials, e.g., the q2-itsxpress tutorial for other fungal ITS analysis options in QIIME 2.

This tutorial details different modalities of analysis for fungal internal transcribed spacer (ITS) data with QIIME 2. Some of these concepts may generalize to other non-16S rRNA marker genes, such as:

  1. issues with sequence length heterogeneity
  2. fitting your own classifier using a custom database (must be in a QIIME 2-compatible format!)

Additionally, this tutorial will give a nice overview of using QIIME 2 and mock communities (e.g., from mockrobiota) to benchmark analysis pipelines. These concepts are useful for any type of marker gene, be it ITS, 16S rRNA, or anything else under the sun. :sunshine:

Let's begin.

mkdir fungi-tutorial
cd fungi-tutorial

Fit a classifier for the UNITE database

wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
unzip 0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip

We need to scrub lowercase characters from the sequences to avoid errors downstream. This is because this version of the UNITE database contains some lowercase characters — this is not something that needs to be done to all databases prior to importing to QIIME 2. Some sequences also have pesky blank spaces at the ends of the lines, so we will remove blank spaces from these lines (credit to @angrybee for spotting this and adding the sed command below to fix this):

awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' developer/sh_refs_qiime_ver7_99_01.12.2017_dev.fasta | tr -d ' ' > developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta
qiime tools import \
 --type FeatureData[Sequence] \
 --input-path developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta \
 --output-path unite-ver7-99-seqs-01.12.2017.qza
qiime tools import \
 --type FeatureData[Taxonomy] \
 --input-path developer/sh_taxonomy_qiime_ver7_99_01.12.2017_dev.txt \
 --output-path unite-ver7-99-tax-01.12.2017.qza \
 --input-format HeaderlessTSVTaxonomyFormat

Finally, let's just train the classifier. This will take some time! So we can just download one that I've made for you below :wink:

qiime feature-classifier fit-classifier-naive-bayes \
 --i-reference-reads unite-ver7-99-seqs-01.12.2017.qza \
 --i-reference-taxonomy unite-ver7-99-tax-01.12.2017.qza \
 --o-classifier unite-ver7-99-classifier-01.12.2017.qza

As promised, you can just download this instead:

wget https://s3-us-west-2.amazonaws.com/qiime2-data/community-contributions-data-resources/2018.11-unite-classifier/unite-ver7-99-classifier-01.12.2017.qza

Download an ITS mock community from mockrobiota

We will use a fungal ITS mock community published by Taylor et al., 2016. First download the sequence data and metadata from mockrobiota:

wget -O "mock-25-sample-metadata.tsv" https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-25/sample-metadata.tsv
wget https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-25/mock-forward-read.fastq.gz
wget https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-25/mock-reverse-read.fastq.gz

Create a fastq manifest file for importing these data into QIIME 2, fastqmanifest.csv:

echo "sample-id,absolute-filepath,direction" > fastqmanifest.csv
echo "Mock.1,$PWD/mock-forward-read.fastq.gz,forward" >> fastqmanifest.csv
echo "Mock.1,$PWD/mock-reverse-read.fastq.gz,reverse" >> fastqmanifest.csv
qiime tools import \
 --type SampleData[PairedEndSequencesWithQuality] \
 --input-path fastqmanifest.csv \
 --output-path demux.qza \
 --input-format PairedEndFastqManifestPhred33
qiime demux summarize \
 --i-data demux.qza \
 --o-visualization demux.qzv

To trim or not to trim

One issue with ITS (and other marker genes with vast length variability) is readthrough, which occurs when read lengths are longer than the amplicon itself! The polymerase will read through the amplicon, the primer, the barcode, and on into the adapter sequence. This is non-biological DNA that will cause major issues downstream, e.g., with sequence classification. So we want to trim primers from either end of the sequence to eliminate read-through issues. Enter cutadapt. Note that we trim the forward primer and the reverse complement of the reverse primer from the forward reads (the forward primers have already been trimmed in the raw reads, but we will demonstrate forward + reverse trimming here since attempting to trim the forward read will not hurt). We trim the reverse primer and reverse complement of the forward primer from the reverse reads.

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-adapter-f AYTTAAGCATATCAATAAGCGGAGGCT \
  --p-front-f AACTTTYRRCAAYGGATCWCT \
  --p-adapter-r AGWGATCCRTTGYYRAAAGTT \
  --p-front-r AGCCTCCGCTTATTGATATGCTTAART \
  --o-trimmed-sequences demux-trimmed.qza
qiime demux summarize \
 --i-data demux-trimmed.qza \
 --o-visualization demux-trimmed.qzv

Compare the demux-trimmed.qzv summary to the demux.qzv summary that we made before trimming. It doesn't look like the sequences were trimmed, does it? According to Taylor et al. 2016 (the source of the primers and mock community used here), the predicted amplicon length range of the 5.8S-Fun + ITS4-Fun primer set is 267-511 bp (see Figure 1). So with the read lengths we are using here (250 nt) we should not see any read-through.

Denoising

Even though we have paired-end reads, the read quality in the mock-25 dataset is not good enough to support merging the expected amplicon length (See the "interactive quality plots" tab in demux.qzv to visualize how quality drops off in both forward and reverse reads.) We can use qiime dada2 denoise-paired to confirm that we lose the majority of our reads at the merging stage (I have already confirmed this). The best solution in this case is to proceed with denoise-single (using only the forward reads).

qiime dada2 denoise-single \
  --i-demultiplexed-seqs demux-trimmed.qza \
  --p-trim-left 13 \
  --p-trunc-len 160 \
  --o-representative-sequences dada2-single-end-rep-seqs.qza \
  --o-table dada2-single-end-table.qza \
  --o-denoising-stats dada2-single-end-stats.qza
qiime metadata tabulate \
  --m-input-file dada2-single-end-stats.qza \
  --o-visualization dada2-single-end-stats.qzv

In your own data you should follow this process to determine whether you have enough sequence length to proceed with pairend-end sequences, or if you must proceed with single-end reads due to insufficient read overlap (as we have done above):

  1. Determine the expected amplicon length for your primers
  2. Check the amplicon length and assess the quality of your reads (use the demux.qzv or demux-trimmed.qzv summaries)
  3. Are your (trimmed) reads long enough to overlap? dada2 expects a minimum overlap of 12 nt, so your reads are long enough if: length(forward read) + length(reverse read) ≥ 12 nt.
  4. If your answer to #3 is "definitely not!", proceed with single-end reads as we have shown here. If your answer to #3 is "maybe?", give paired-end denoising a shot but pay very close attention to how many reads are being lost after merging (check out the stats file that is output by dada2).

Taxonomic classification

Now we will use the taxonomy classifier that we trained above to predict the taxonomic assignment for each of our denoised sequence variants.

qiime feature-classifier classify-sklearn \
  --i-classifier unite-ver7-99-classifier-01.12.2017.qza \
  --i-reads dada2-single-end-rep-seqs.qza \
  --o-classification taxonomy-single-end.qza
qiime taxa barplot \
  --i-table dada2-single-end-table.qza \
  --i-taxonomy taxonomy-single-end.qza \
  --m-metadata-file mock-25-sample-metadata.tsv \
  --o-visualization taxa-bar-plots.qzv

taxa-bar-plots.qzv (328.4 KB)

Other analyses

Now that you have a feature table and taxonomy classifications, you can typically proceed with analysis of fungal ITS (and other marker-gene data) in the same way that you would 16S rRNA gene data. With one major exception. Although this marker provides more accurate species identification than other fungal marker genes, alignments of ITS sequences are not useful for informing evolutionary distances among distantly related species (but can yield good alignments for closely related sequences). So you have two options for diversity and other potentially phylogenetic analyses:

  1. do not use phylogenetic methods with your ITS data. Instead, use non-phylogenetic methods diversity analyses, such as Jaccard (qualitative) or Bray Curtis (quantitative) for estimating beta diversity.
  2. Use q2-ghost-tree to infer phylogeny via grafting ITS sequences. This will enable use of phylogenetic methods, including UniFrac.

Many other tutorials (e.g., the overview for beginners) outline some of the possible downstream analyses possible with any feature table (in theory). So we will not revisit these options here. Let's talk about something else instead.

Evaluate accuracy

NOTE: If you are following this tutorial to analyze your own data (of real samples), stop here. The following section is specific for this tutorial and other mock communities. This is not a meaningful analysis of real samples.

We just analyzed a mock community, meaning that we know the expected composition of that sample's taxonomic abundances and sequences. The expected composition would be precisely what our inputs were, converted to relative abundances. Taxonomic names are mapped to the nearest match in our reference sequence database of choice (the same one we used for taxonomic classification) — something that mockrobiota already does for us. So we can use mock communities like this to evaluate he accuracy of different bioinformatics methods or pipelines. Let's compare the methods we used to see how well they did!

First we will grab the expected composition of this mock community:

wget https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-25/unite/7-1/99-otus/expected-taxonomy.tsv

That file was annotated with an older version of the UNITE database, so let's fix some of the annotations (to match the version we are using) before proceeding:

sed 's/s__Tylospora_asterophora/s__unidentified/' expected-taxonomy.tsv | sed 's/p__Zygomycota;c__Mortierellomycotina_cls_Incertae_sedis/p__Mortierellomycota;c__Mortierellomycetes/' | sed 's/c__Chytridiomycetes/c__Spizellomycetes/' | sed 's/g__Coprinopsis/g__Coprinopsis;s__unidentified/' | sed 's/g__Tricholoma/g__Tricholoma;s__Tricholoma_vaccinum/'> expected-taxonomy-mod.tsv

Convert to biom and import

biom convert \
  -i expected-taxonomy-mod.tsv \
  -o expected-taxonomy.biom \
  --table-type="OTU table" \
  --to-json
qiime tools import \
 --type FeatureTable[RelativeFrequency] \
 --input-path expected-taxonomy.biom \
 --input-format BIOMV100Format \
 --output-path expected-taxonomy.qza

Collapse our features on taxonomy and convert our feature table of observations to relative frequency:

qiime taxa collapse \
  --i-table dada2-single-end-table.qza \
  --i-taxonomy taxonomy-single-end.qza \
  --p-level 7 \
  --o-collapsed-table dada2-single-end-table-collapsed.qza

qiime feature-table relative-frequency \
  --i-table dada2-single-end-table-collapsed.qza \
  --o-relative-frequency-table dada2-single-end-table-relative.qza

Evaluate accuracy:

qiime quality-control evaluate-composition \
  --i-expected-features expected-taxonomy.qza \
  --i-observed-features dada2-single-end-table-relative.qza \
  --o-visualization dada2-single-end-evaluation.qzv

dada2-single-end-evaluation.qzv (357.4 KB)

What does it all mean? You can read more about TAR and TDR scores here. Other metrics should be more familiar to you.

What this is showing: We do pretty well! At least as far as TDR is concerned. TDR is high at species level, indicating a very low number of false negatives (see list at bottom). TAR increases with sequencing depth (as does # of observed species/expected), indicating an increasing number of false positives. What are we missing? Which species are being misclassified and why?

BONUS: we can use this evaluation to compare multiple methods. How does dada2 single end with cutadapt trimming compare to no trimming or other methods?

23 Likes
ITS fungal data processing
Mycobiome tools
Mycobiome_ITS_Remove primers & barcode sequence
Denoise using dada2 for paired end
Application of qiime2 in undergraduate education
itsxpress does not extract sequences where ITSx does
ITS data processing
Non-chimeric reference dataset selection for uchime-ref (reference-based chimera detection) for fungal ITS sequences
Cutadapt paired end read through
Phylogeny result intrepretation
Ghost tree filtering?
Which primer should I use in fungal ITS2 region amplicon sequencing
Fungal ITS sequence from Illumina returns many unclassified reads
How to incorporate mock community into my analysis
Dada2 analysis queries
Fungal ITS UNITE Naive Baysian Classifier problem
Feature Classifier version error
Training-feature-classifier 2019.10: Error
Full UNITE+INSD dataset for Fungi
feature counts by sample? and how to do correct rarefy
Dada2 error with grinder simulated sequence errors
Mock Communities for Assessing Quality Control
Compare taxa summaries in Qiime2
DADA2 trunc ITS2
Meaning of proportion output of q2 quality-control evaluate-composition
16SITS23S Analysis
Plugin error from quality-control: min() arg is an empty sequence
Data loss when trimming ITS
ITS classification - nothing beyond phylum level
fragment insertion sepp error
changing --p-max-ee-r to obtain more filtered sequence in DADA2
ITS cutadapt trimming of primer and reverse complement of the reverse primer
Suggestions for ASV table with ITS (fungi)
ITS data analysis : How to determine which dada2 analysis is the best
ITS2 primers trimming without the primers sequences
ITS data analysis : How to determine which dada2 analysis is the best
is it ok to do taxonomic classification by local alignment search tool (BLAST from NCBI) for OTUs obtained from QIIME2
quality scores boxplot bad quality
Mock community assessment and quality control?
ITS - fungal analysis
Error assigning taxonomy with UNITE
Newest ITS tutorial
ITS2 analysis SOP with QIIME 2?
Using different primer sets to amplify two different hypervariable regions on one illumina run
Cut-adapt for ITS primers w degenerative bases
Invalid Value: ' ' is not a writable directory --Help!
DADA2 with the ITS2
Problems with q2-quality-control using Mockrobiota
Problems with q2-quality-control using Mockrobiota
Classification of ITS question
SeqID for Classifier Contain both numerical and alphabetical characters
Elevated Unassigned sequences with ITS data
Mock Communities for Assessing Quality Control
Evaluating and controlling data quality with q2-quality-control
About importing files of qc_mock_observed and qc_mock_expected
ITS cutadapt trimming of primer and reverse complement of the reverse primer
Best approach for analyzing Sanger sequences from ITS2 region
"Same" database, different outputs
IUPAC character problem while importing UNITE dataset
cutadapt / trim-paired / option "front" and "adapter"
qiime feature-classifier UNITE Qiime-2019.7
UNITE error in Developer DB
Creating NCBI database and classifiers for 16S V3-V4 and ITS analyses

Since 2019.7 there has to be done more transforming with the latest unite DB. There are blanks at the end of the lines which are a problem.

I use sed -e '/^>/!s/\(.*\)/\U\1/;s/[[:blank:]]*$//' {input} > {output} for that.

4 Likes

Thanks @angrybee! Could you please clarify: is this a newer release of UNITE that has that issue? Please let us know the version. Thanks!

I use version 8 from 2.2.19 and the developer files.

Maybe a side effect from

[q2-types] @Oddant1 Made validation of FASTA files more robust.

There was a problem importing …/…/data/classifiers_trained_2019.7_190731/qiime_ver8_99_02.02.2019_dev_uppercase.fasta:
…/…/data/classifiers_trained_2019.7_190731/qiime_ver8_99_02.02.2019_dev_uppercase.fasta is not a(n) DNAFASTAFormat file:
Invalid characters on line 10274 (does not match IUPAC characters for a DNA sequence).

1 Like

You are right! Looks like the upgraded FASTA validator is doing its job!

I have edited the post above to include your fix, and credit you.

I think you can skip awk. sed does the uppercase and removes the blanks at line end for every line not starting with >

The sed command is not replacing all lowercases in the sequences. I am happy to drop the awk if you send a fixed sed command, but it’s not a big deal — the chained command takes seconds.

2 Likes

A post was split to a new topic: Cutadapt trimming in ITS tutorial

2 posts were split to a new topic: UNITE error in Developer DB

Hi,

I’m having trouble with the above sed command. It is adding a “U” to the beginning of each of my sequences, which throws an error when I try to create the classifier.

I’m new to sed so any help would be appreciated!

Here is the sed command I tried running, the only difference should be my file names:
awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' unite_db_files/sh_refs_qiime_ver8_97_02.02.2019.fasta | sed -e '/^>/!s/\(.*\)/\U\1/;s/[[:blank:]]*$//' > unite_db_files/sh_refs_qiime_ver8_97_02.02.2019_uppercase.fasta

I played around with the command and if I remove /^>/!s/\(.*\)/\U\1/ (which is the portion adding the U) from the command it doesn’t add any U’s to my sequences and appears to retain the separation between the header and sequence portions of the file. The classifier step also runs without error.

Have I broken something I’m not aware of and my taxonomy classifications will be off or is it ok to use the command with the alteration I made?

Thanks,
Samantha

Thanks for reporting @saatkinson!

I am guessing this is breaking because of the new UNITE release (this was written based on an older UNITE release).

I think your edits are fine… I have tested using this command and it fixes the lowercase and blank space issues for this release:

awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' developer/sh_refs_qiime_ver8_99_02.02.2019_dev.fasta | tr -d ' ' > developer/sh_refs_qiime_ver8_99_02.02.2019_dev_uppercase.fasta

I have updated the tutorial above to replace that command.

3 Likes

A post was split to a new topic: Feature Classifier version error

An off-topic reply has been split into a new topic: importing paired-end fungal ITS reads

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Newest ITS tutorial

Please keep replies on-topic in the future.

Thanks for your really helpful tutorial. One issue I had running it with my own data was doing cutadapt in one step as it didn't trim the adapters correctly. Running it in two steps seems to work though. I think there have been a few other reports of this on the forum.

In case you're interested in the original issue and solutions: ITS cutadapt trimming of primer and reverse complement of the reverse primer and cutadapt / trim-paired / option "front" and "adapter" and Cut-adapt trim paired - different results when primers separate vs linked.

2 Likes

2 off-topic replies have been split into a new topic: analyzing paired-end fungal ITS sequence data with QIIME 2

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Link BIOM table with taxonomy

Please keep replies on-topic in the future.