Fungal ITS analysis tutorial

(Nicholas Bokulich) #1

Fungal ITS analysis, mock communities, and more fun

NOTE: This tutorial was written in a QIIME 2 2018.11 environment. It is not guaranteed to work with earlier or later versions of QIIME 2. This tutorial was compiled as a working exercise for a QIIME 2 workshop in December 2018, and does not represent the only possible fungal ITS workflow with QIIME 2, or even a benchmarked protocol recommendation. See other tutorials, e.g., the q2-itsxpress tutorial for other fungal ITS analysis options in QIIME 2.

This tutorial details different modalities of analysis for fungal internal transcribed spacer (ITS) data with QIIME 2. Some of these concepts may generalize to other non-16S rRNA marker genes, such as:

  1. issues with sequence length heterogeneity
  2. fitting your own classifier using a custom database (must be in a QIIME 2-compatible format!)

Additionally, this tutorial will give a nice overview of using QIIME 2 and mock communities (e.g., from mockrobiota) to benchmark analysis pipelines. These concepts are useful for any type of marker gene, be it ITS, 16S rRNA, or anything else under the sun. :sunshine:

Let’s begin.

mkdir fungi-tutorial
cd fungi-tutorial

Fit a classifier for the UNITE database


We need to scrub lowercase characters from the sequences to avoid errors downstream. This is because this version of the UNITE database contains some lowercase characters — this is not something that needs to be done to all databases prior to importing to QIIME 2.

awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' developer/sh_refs_qiime_ver7_99_01.12.2017_dev.fasta > developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta
qiime tools import \
 --type FeatureData[Sequence] \
 --input-path developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta \
 --output-path unite-ver7-99-seqs-01.12.2017.qza
qiime tools import \
 --type FeatureData[Taxonomy] \
 --input-path developer/sh_taxonomy_qiime_ver7_99_01.12.2017_dev.txt \
 --output-path unite-ver7-99-tax-01.12.2017.qza \
 --input-format HeaderlessTSVTaxonomyFormat

Finally, let’s just train the classifier. This will take some time! So we can just download one that I’ve made for you below :wink:

qiime feature-classifier fit-classifier-naive-bayes \
 --i-reference-reads unite-ver7-99-seqs-01.12.2017.qza \
 --i-reference-taxonomy unite-ver7-99-tax-01.12.2017.qza \
 --o-classifier unite-ver7-99-classifier-01.12.2017.qza

As promised, you can just download this instead:


Download an ITS mock community from mockrobiota

wget -O "mock-25-sample-metadata.tsv"

Create a fastq manifest file for importing these data into QIIME 2, fastqmanifest.csv:

echo "sample-id,absolute-filepath,direction" > fastqmanifest.csv
echo "Mock.1,$PWD/mock-forward-read.fastq.gz,forward" >> fastqmanifest.csv
echo "Mock.1,$PWD/mock-reverse-read.fastq.gz,reverse" >> fastqmanifest.csv
qiime tools import \
 --type SampleData[PairedEndSequencesWithQuality] \
 --input-path fastqmanifest.csv \
 --output-path demux.qza \
 --input-format PairedEndFastqManifestPhred33
qiime demux summarize \
 --i-data demux.qza \
 --o-visualization demux.qzv

To trim or not to trim

One issue with ITS (and other marker genes with vast length variability) is readthrough, which occurs when read lengths are longer than the amplicon itself! The polymerase will read through the amplicon, the primer, the barcode, and on into the adapter sequence. This is non-biological DNA that will cause major issues downstream, e.g., with sequence classification. So we want to trim primers from either end of the sequence to eliminate read-through issues. Enter cutadapt. Note that we trim the forward primer (just in case it is still on the front of our sequences) and the reverse complement of the reverse primer from the forward reads. We trim the reverse primer and reverse complement of the forward primer from the reverse reads.

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --o-trimmed-sequences demux-trimmed.qza
qiime demux summarize \
 --i-data demux-trimmed.qza \
 --o-visualization demux-trimmed.qzv


qiime dada2 denoise-single \
  --i-demultiplexed-seqs demux-trimmed.qza \
  --p-trim-left 13 \
  --p-trunc-len 160 \
  --o-representative-sequences dada2-single-end-rep-seqs.qza \
  --o-table dada2-single-end-table.qza \
  --o-denoising-stats dada2-single-end-stats.qza
qiime metadata tabulate \
  --m-input-file dada2-single-end-stats.qza \
  --o-visualization dada2-single-end-stats.qzv

Taxonomic classification

qiime feature-classifier classify-sklearn \
  --i-classifier unite-ver7-99-classifier-01.12.2017.qza \
  --i-reads dada2-single-end-rep-seqs.qza \
  --o-classification taxonomy-single-end.qza
qiime taxa barplot \
  --i-table dada2-single-end-table.qza \
  --i-taxonomy taxonomy-single-end.qza \
  --m-metadata-file mock-25-sample-metadata.tsv \
  --o-visualization taxa-bar-plots.qzv

taxa-bar-plots.qzv (328.4 KB)

Other analyses

Now that you have a feature table and taxonomy classifications, you can typically proceed with analysis of fungal ITS (and other marker-gene data) in the same way that you would 16S rRNA gene data. With one major exception. Although this marker provides more accurate species identification than other fungal marker genes, alignments of ITS sequences are not useful for informing evolutionary distances among distantly related species (but can yield good alignments for closely related sequences). So you have two options for diversity and other potentially phylogenetic analyses:

  1. do not use phylogenetic methods with your ITS data. Instead, use non-phylogenetic methods diversity analyses, such as Jaccard (qualitative) or Bray Curtis (quantitative) for estimating beta diversity.
  2. Use q2-ghost-tree to infer phylogeny via grafting ITS sequences. This will enable use of phylogenetic methods, including UniFrac.

Many other tutorials (e.g., the overview for beginners) outline some of the possible downstream analyses possible with any feature table (in theory). So we will not revisit these options here. Let’s talk about something else instead.

Evaluate accuracy

NOTE: If you are following this tutorial to analyze your own data (of real samples), stop here. The following section is specific for this tutorial and other mock communities. This is not a meaningful analysis of real samples.

We just analyzed a mock community, meaning that we know the expected composition of that sample’s taxonomic abundances and sequences. The expected composition would be precisely what our inputs were, converted to relative abundances. Taxonomic names are mapped to the nearest match in our reference sequence database of choice (the same one we used for taxonomic classification) — something that mockrobiota already does for us. So we can use mock communities like this to evaluate he accuracy of different bioinformatics methods or pipelines. Let’s compare the methods we used to see how well they did!

First we will grab the expected composition of this mock community:


That file was annotated with an older version of the UNITE database, so let’s fix some of the annotations (to match the version we are using) before proceeding:

sed 's/s__Tylospora_asterophora/s__unidentified/' expected-taxonomy.tsv | sed 's/p__Zygomycota;c__Mortierellomycotina_cls_Incertae_sedis/p__Mortierellomycota;c__Mortierellomycetes/' | sed 's/c__Chytridiomycetes/c__Spizellomycetes/' | sed 's/g__Coprinopsis/g__Coprinopsis;s__unidentified/' | sed 's/g__Tricholoma/g__Tricholoma;s__Tricholoma_vaccinum/'> expected-taxonomy-mod.tsv

Convert to biom and import

biom convert \
  -i expected-taxonomy-mod.tsv \
  -o expected-taxonomy.biom \
  --table-type="OTU table" \
qiime tools import \
 --type FeatureTable[RelativeFrequency] \
 --input-path expected-taxonomy.biom \
 --input-format BIOMV100Format \
 --output-path expected-taxonomy.qza

Collapse our features on taxonomy and convert our feature table of observations to relative frequency:

qiime taxa collapse \
  --i-table dada2-single-end-table.qza \
  --i-taxonomy taxonomy-single-end.qza \
  --p-level 7 \
  --o-collapsed-table dada2-single-end-table-collapsed.qza

qiime feature-table relative-frequency \
  --i-table dada2-single-end-table-collapsed.qza \
  --o-relative-frequency-table dada2-single-end-table-relative.qza

Evaluate accuracy:

qiime quality-control evaluate-composition \
  --i-expected-features expected-taxonomy.qza \
  --i-observed-features dada2-single-end-table-relative.qza \
  --o-visualization dada2-single-end-evaluation.qzv

dada2-single-end-evaluation.qzv (357.4 KB)

What does it all mean? You can read more about TAR and TDR scores here. Other metrics should be more familiar to you.

What this is showing: We do pretty well! At least as far as TDR is concerned. TDR is high at species level, indicating a very low number of false negatives (see list at bottom). TAR increases with sequencing depth (as does # of observed species/expected), indicating an increasing number of false positives. What are we missing? Which species are being misclassified and why?

BONUS: we can use this evaluation to compare multiple methods. How does dada2 single end with cutadapt trimming compare to no trimming or other methods?

Mycobiome tools
Denoise using dada2 for paired end
Mycobiome_ITS_Remove primers & barcode sequence
Application of qiime2 in undergraduate education