Fungal ITS analysis tutorial

Nicholas_Bokulich · December 9, 2018, 2:56pm

Fungal ITS analysis, mock communities, and more fun

NOTE: This tutorial was written in a QIIME 2 2018.11 environment. It is not guaranteed to work with earlier or later versions of QIIME 2. This tutorial was compiled as a working exercise for a QIIME 2 workshop in December 2018, and does not represent the only possible fungal ITS workflow with QIIME 2, or even a benchmarked protocol recommendation. See other tutorials, e.g., the q2-itsxpress tutorial for other fungal ITS analysis options in QIIME 2.

This tutorial details different modalities of analysis for fungal internal transcribed spacer (ITS) data with QIIME 2. Some of these concepts may generalize to other non-16S rRNA marker genes, such as:

issues with sequence length heterogeneity
fitting your own classifier using a custom database (must be in a QIIME 2-compatible format!)

Additionally, this tutorial will give a nice overview of using QIIME 2 and mock communities (e.g., from mockrobiota) to benchmark analysis pipelines. These concepts are useful for any type of marker gene, be it ITS, 16S rRNA, or anything else under the sun. :sunshine:

Let's begin.

mkdir fungi-tutorial
cd fungi-tutorial

Fit a classifier for the UNITE database

wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip

unzip 0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip

We need to scrub lowercase characters from the sequences to avoid errors downstream. This is because this version of the UNITE database contains some lowercase characters — this is not something that needs to be done to all databases prior to importing to QIIME 2. Some sequences also have pesky blank spaces at the ends of the lines, so we will remove blank spaces from these lines (credit to @angrybee for spotting this and adding the sed command below to fix this):

awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' developer/sh_refs_qiime_ver7_99_01.12.2017_dev.fasta | tr -d ' ' > developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta

qiime tools import \
 --type FeatureData[Sequence] \
 --input-path developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta \
 --output-path unite-ver7-99-seqs-01.12.2017.qza

qiime tools import \
 --type FeatureData[Taxonomy] \
 --input-path developer/sh_taxonomy_qiime_ver7_99_01.12.2017_dev.txt \
 --output-path unite-ver7-99-tax-01.12.2017.qza \
 --input-format HeaderlessTSVTaxonomyFormat

Finally, let's just train the classifier. This will take some time! So we can just download one that I've made for you below

qiime feature-classifier fit-classifier-naive-bayes \
 --i-reference-reads unite-ver7-99-seqs-01.12.2017.qza \
 --i-reference-taxonomy unite-ver7-99-tax-01.12.2017.qza \
 --o-classifier unite-ver7-99-classifier-01.12.2017.qza

As promised, you can just download this instead:

wget https://s3-us-west-2.amazonaws.com/qiime2-data/community-contributions-data-resources/2018.11-unite-classifier/unite-ver7-99-classifier-01.12.2017.qza

Download an ITS mock community from mockrobiota

We will use a fungal ITS mock community published by Taylor et al., 2016. First download the sequence data and metadata from mockrobiota:

wget -O "mock-25-sample-metadata.tsv" https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-25/sample-metadata.tsv
wget https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-25/mock-forward-read.fastq.gz
wget https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-25/mock-reverse-read.fastq.gz

Create a fastq manifest file for importing these data into QIIME 2, fastqmanifest.csv:

echo "sample-id,absolute-filepath,direction" > fastqmanifest.csv
echo "Mock.1,$PWD/mock-forward-read.fastq.gz,forward" >> fastqmanifest.csv
echo "Mock.1,$PWD/mock-reverse-read.fastq.gz,reverse" >> fastqmanifest.csv

qiime tools import \
 --type SampleData[PairedEndSequencesWithQuality] \
 --input-path fastqmanifest.csv \
 --output-path demux.qza \
 --input-format PairedEndFastqManifestPhred33

qiime demux summarize \
 --i-data demux.qza \
 --o-visualization demux.qzv

To trim or not to trim

One issue with ITS (and other marker genes with vast length variability) is readthrough, which occurs when read lengths are longer than the amplicon itself! The polymerase will read through the amplicon, the primer, the barcode, and on into the adapter sequence. This is non-biological DNA that will cause major issues downstream, e.g., with sequence classification. So we want to trim primers from either end of the sequence to eliminate read-through issues. Enter cutadapt. Note that we trim the forward primer and the reverse complement of the reverse primer from the forward reads (the forward primers have already been trimmed in the raw reads, but we will demonstrate forward + reverse trimming here since attempting to trim the forward read will not hurt). We trim the reverse primer and reverse complement of the forward primer from the reverse reads.

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-adapter-f AYTTAAGCATATCAATAAGCGGAGGCT \
  --p-front-f AACTTTYRRCAAYGGATCWCT \
  --p-adapter-r AGWGATCCRTTGYYRAAAGTT \
  --p-front-r AGCCTCCGCTTATTGATATGCTTAART \
  --o-trimmed-sequences demux-trimmed.qza

qiime demux summarize \
 --i-data demux-trimmed.qza \
 --o-visualization demux-trimmed.qzv

Compare the demux-trimmed.qzv summary to the demux.qzv summary that we made before trimming. It doesn't look like the sequences were trimmed, does it? According to Taylor et al. 2016 (the source of the primers and mock community used here), the predicted amplicon length range of the 5.8S-Fun + ITS4-Fun primer set is 267-511 bp (see Figure 1). So with the read lengths we are using here (250 nt) we should not see any read-through.

Denoising

Even though we have paired-end reads, the read quality in the mock-25 dataset is not good enough to support merging the expected amplicon length (See the "interactive quality plots" tab in demux.qzv to visualize how quality drops off in both forward and reverse reads.) We can use qiime dada2 denoise-paired to confirm that we lose the majority of our reads at the merging stage (I have already confirmed this). The best solution in this case is to proceed with denoise-single (using only the forward reads).

qiime dada2 denoise-single \
  --i-demultiplexed-seqs demux-trimmed.qza \
  --p-trim-left 13 \
  --p-trunc-len 160 \
  --o-representative-sequences dada2-single-end-rep-seqs.qza \
  --o-table dada2-single-end-table.qza \
  --o-denoising-stats dada2-single-end-stats.qza

qiime metadata tabulate \
  --m-input-file dada2-single-end-stats.qza \
  --o-visualization dada2-single-end-stats.qzv

In your own data you should follow this process to determine whether you have enough sequence length to proceed with pairend-end sequences, or if you must proceed with single-end reads due to insufficient read overlap (as we have done above):

Determine the expected amplicon length for your primers
Check the amplicon length and assess the quality of your reads (use the demux.qzv or demux-trimmed.qzv summaries)
Are your (trimmed) reads long enough to overlap? dada2 expects a minimum overlap of 12 nt, so your reads are long enough if: length(forward read) + length(reverse read) ≥ 12 nt.
If your answer to #3 is "definitely not!", proceed with single-end reads as we have shown here. If your answer to #3 is "maybe?", give paired-end denoising a shot but pay very close attention to how many reads are being lost after merging (check out the stats file that is output by dada2).

Taxonomic classification

Now we will use the taxonomy classifier that we trained above to predict the taxonomic assignment for each of our denoised sequence variants.

qiime feature-classifier classify-sklearn \
  --i-classifier unite-ver7-99-classifier-01.12.2017.qza \
  --i-reads dada2-single-end-rep-seqs.qza \
  --o-classification taxonomy-single-end.qza

qiime taxa barplot \
  --i-table dada2-single-end-table.qza \
  --i-taxonomy taxonomy-single-end.qza \
  --m-metadata-file mock-25-sample-metadata.tsv \
  --o-visualization taxa-bar-plots.qzv

taxa-bar-plots.qzv (328.4 KB)

Other analyses

Now that you have a feature table and taxonomy classifications, you can typically proceed with analysis of fungal ITS (and other marker-gene data) in the same way that you would 16S rRNA gene data. With one major exception. Although this marker provides more accurate species identification than other fungal marker genes, alignments of ITS sequences are not useful for informing evolutionary distances among distantly related species (but can yield good alignments for closely related sequences). So you have two options for diversity and other potentially phylogenetic analyses:

do not use phylogenetic methods with your ITS data. Instead, use non-phylogenetic methods diversity analyses, such as Jaccard (qualitative) or Bray Curtis (quantitative) for estimating beta diversity.
Use q2-ghost-tree to infer phylogeny via grafting ITS sequences. This will enable use of phylogenetic methods, including UniFrac. Edit 2024: ghosttree is no longer maintained, so is no longer a solution for estimating phylogenies from ITS sequences

Many other tutorials (e.g., the overview for beginners) outline some of the possible downstream analyses possible with any feature table (in theory). So we will not revisit these options here. Let's talk about something else instead.

Evaluate accuracy

NOTE: If you are following this tutorial to analyze your own data (of real samples), stop here. The following section is specific for this tutorial and other mock communities. This is not a meaningful analysis of real samples.

We just analyzed a mock community, meaning that we know the expected composition of that sample's taxonomic abundances and sequences. The expected composition would be precisely what our inputs were, converted to relative abundances. Taxonomic names are mapped to the nearest match in our reference sequence database of choice (the same one we used for taxonomic classification) — something that mockrobiota already does for us. So we can use mock communities like this to evaluate he accuracy of different bioinformatics methods or pipelines. Let's compare the methods we used to see how well they did!

First we will grab the expected composition of this mock community:

wget https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-25/unite/7-1/99-otus/expected-taxonomy.tsv

That file was annotated with an older version of the UNITE database, so let's fix some of the annotations (to match the version we are using) before proceeding:

sed 's/s__Tylospora_asterophora/s__unidentified/' expected-taxonomy.tsv | sed 's/p__Zygomycota;c__Mortierellomycotina_cls_Incertae_sedis/p__Mortierellomycota;c__Mortierellomycetes/' | sed 's/c__Chytridiomycetes/c__Spizellomycetes/' | sed 's/g__Coprinopsis/g__Coprinopsis;s__unidentified/' | sed 's/g__Tricholoma/g__Tricholoma;s__Tricholoma_vaccinum/'> expected-taxonomy-mod.tsv

Convert to biom and import

biom convert \
  -i expected-taxonomy-mod.tsv \
  -o expected-taxonomy.biom \
  --table-type="OTU table" \
  --to-json
qiime tools import \
 --type FeatureTable[RelativeFrequency] \
 --input-path expected-taxonomy.biom \
 --input-format BIOMV100Format \
 --output-path expected-taxonomy.qza

Collapse our features on taxonomy and convert our feature table of observations to relative frequency:

qiime taxa collapse \
  --i-table dada2-single-end-table.qza \
  --i-taxonomy taxonomy-single-end.qza \
  --p-level 7 \
  --o-collapsed-table dada2-single-end-table-collapsed.qza

qiime feature-table relative-frequency \
  --i-table dada2-single-end-table-collapsed.qza \
  --o-relative-frequency-table dada2-single-end-table-relative.qza

Evaluate accuracy:

qiime quality-control evaluate-composition \
  --i-expected-features expected-taxonomy.qza \
  --i-observed-features dada2-single-end-table-relative.qza \
  --o-visualization dada2-single-end-evaluation.qzv

dada2-single-end-evaluation.qzv (357.4 KB)

What does it all mean? You can read more about TAR and TDR scores here. Other metrics should be more familiar to you.

What this is showing: We do pretty well! At least as far as TDR is concerned. TDR is high at species level, indicating a very low number of false negatives (see list at bottom). TAR increases with sequencing depth (as does # of observed species/expected), indicating an increasing number of false positives. What are we missing? Which species are being misclassified and why?

BONUS: we can use this evaluation to compare multiple methods. How does dada2 single end with cutadapt trimming compare to no trimming or other methods?

angrybee · July 31, 2019, 11:52am

Since 2019.7 there has to be done more transforming with the latest unite DB. There are blanks at the end of the lines which are a problem.

I use sed -e '/^>/!s/$.*$/\U\1/;s/[[:blank:]]*$//' {input} > {output} for that.

Nicholas_Bokulich · July 31, 2019, 11:53am

Thanks @angrybee! Could you please clarify: is this a newer release of UNITE that has that issue? Please let us know the version. Thanks!

angrybee · July 31, 2019, 12:21pm

I use version 8 from 2.2.19 and the developer files.

Maybe a side effect from

[q2-types] @Oddant1 Made validation of FASTA files more robust.

There was a problem importing ../../data/classifiers_trained_2019.7_190731/qiime_ver8_99_02.02.2019_dev_uppercase.fasta:
../../data/classifiers_trained_2019.7_190731/qiime_ver8_99_02.02.2019_dev_uppercase.fasta is not a(n) DNAFASTAFormat file:
Invalid characters on line 10274 (does not match IUPAC characters for a DNA sequence).

Nicholas_Bokulich · July 31, 2019, 12:40pm

You are right! Looks like the upgraded FASTA validator is doing its job!

I have edited the post above to include your fix, and credit you.

angrybee · July 31, 2019, 2:16pm

I think you can skip awk. sed does the uppercase and removes the blanks at line end for every line not starting with >

Nicholas_Bokulich · July 31, 2019, 2:22pm

The sed command is not replacing all lowercases in the sequences. I am happy to drop the awk if you send a fixed sed command, but it's not a big deal — the chained command takes seconds.

jwdebelius · August 28, 2019, 7:43am

A post was split to a new topic: Cutadapt trimming in ITS tutorial

jwdebelius · September 30, 2019, 1:48pm

2 posts were split to a new topic: UNITE error in Developer DB

saatkinson · January 16, 2020, 5:00pm

Hi,

I'm having trouble with the above sed command. It is adding a "U" to the beginning of each of my sequences, which throws an error when I try to create the classifier.

I'm new to sed so any help would be appreciated!

Here is the sed command I tried running, the only difference should be my file names:
awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' unite_db_files/sh_refs_qiime_ver8_97_02.02.2019.fasta | sed -e '/^>/!s/$.*$/\U\1/;s/[[:blank:]]*$//' > unite_db_files/sh_refs_qiime_ver8_97_02.02.2019_uppercase.fasta

I played around with the command and if I remove /^>/!s/$.*$/\U\1/ (which is the portion adding the U) from the command it doesn't add any U's to my sequences and appears to retain the separation between the header and sequence portions of the file. The classifier step also runs without error.

Have I broken something I'm not aware of and my taxonomy classifications will be off or is it ok to use the command with the alteration I made?

Thanks,
Samantha

Nicholas_Bokulich · January 16, 2020, 6:31pm

Thanks for reporting @saatkinson!

I am guessing this is breaking because of the new UNITE release (this was written based on an older UNITE release).

I think your edits are fine... I have tested using this command and it fixes the lowercase and blank space issues for this release:

awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' developer/sh_refs_qiime_ver8_99_02.02.2019_dev.fasta | tr -d ' ' > developer/sh_refs_qiime_ver8_99_02.02.2019_dev_uppercase.fasta

I have updated the tutorial above to replace that command.

jwdebelius · January 24, 2020, 2:48pm

A post was split to a new topic: Feature Classifier version error

Nicholas_Bokulich · November 18, 2021, 6:12am

An off-topic reply has been split into a new topic: importing paired-end fungal ITS reads