Fungal ITS analysis, mock communities, and more fun
NOTE: This tutorial was written in a QIIME 2 2018.11 environment. It is not guaranteed to work with earlier or later versions of QIIME 2. This tutorial was compiled as a working exercise for a QIIME 2 workshop in December 2018, and does not represent the only possible fungal ITS workflow with QIIME 2, or even a benchmarked protocol recommendation. See other tutorials, e.g., the q2-itsxpress tutorial for other fungal ITS analysis options in QIIME 2.
This tutorial details different modalities of analysis for fungal internal transcribed spacer (ITS) data with QIIME 2. Some of these concepts may generalize to other non-16S rRNA marker genes, such as:
- issues with sequence length heterogeneity
- fitting your own classifier using a custom database (must be in a QIIME 2-compatible format!)
Additionally, this tutorial will give a nice overview of using QIIME 2 and mock communities (e.g., from mockrobiota) to benchmark analysis pipelines. These concepts are useful for any type of marker gene, be it ITS, 16S rRNA, or anything else under the sun. :sunshine:
Let's begin.
mkdir fungi-tutorial
cd fungi-tutorial
Fit a classifier for the UNITE database
wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
unzip 0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
We need to scrub lowercase characters from the sequences to avoid errors downstream. This is because this version of the UNITE database contains some lowercase characters — this is not something that needs to be done to all databases prior to importing to QIIME 2. Some sequences also have pesky blank spaces at the ends of the lines, so we will remove blank spaces from these lines (credit to @angrybee for spotting this and adding the sed
command below to fix this):
awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' developer/sh_refs_qiime_ver7_99_01.12.2017_dev.fasta | tr -d ' ' > developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta
qiime tools import \
--type FeatureData[Sequence] \
--input-path developer/sh_refs_qiime_ver7_99_01.12.2017_dev_uppercase.fasta \
--output-path unite-ver7-99-seqs-01.12.2017.qza
qiime tools import \
--type FeatureData[Taxonomy] \
--input-path developer/sh_taxonomy_qiime_ver7_99_01.12.2017_dev.txt \
--output-path unite-ver7-99-tax-01.12.2017.qza \
--input-format HeaderlessTSVTaxonomyFormat
Finally, let's just train the classifier. This will take some time! So we can just download one that I've made for you below
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads unite-ver7-99-seqs-01.12.2017.qza \
--i-reference-taxonomy unite-ver7-99-tax-01.12.2017.qza \
--o-classifier unite-ver7-99-classifier-01.12.2017.qza
As promised, you can just download this instead:
wget https://s3-us-west-2.amazonaws.com/qiime2-data/community-contributions-data-resources/2018.11-unite-classifier/unite-ver7-99-classifier-01.12.2017.qza
Download an ITS mock community from mockrobiota
We will use a fungal ITS mock community published by Taylor et al., 2016. First download the sequence data and metadata from mockrobiota:
wget -O "mock-25-sample-metadata.tsv" https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-25/sample-metadata.tsv
wget https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-25/mock-forward-read.fastq.gz
wget https://s3-us-west-2.amazonaws.com/mockrobiota/latest/mock-25/mock-reverse-read.fastq.gz
Create a fastq manifest file for importing these data into QIIME 2, fastqmanifest.csv
:
echo "sample-id,absolute-filepath,direction" > fastqmanifest.csv
echo "Mock.1,$PWD/mock-forward-read.fastq.gz,forward" >> fastqmanifest.csv
echo "Mock.1,$PWD/mock-reverse-read.fastq.gz,reverse" >> fastqmanifest.csv
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path fastqmanifest.csv \
--output-path demux.qza \
--input-format PairedEndFastqManifestPhred33
qiime demux summarize \
--i-data demux.qza \
--o-visualization demux.qzv
To trim or not to trim
One issue with ITS (and other marker genes with vast length variability) is readthrough, which occurs when read lengths are longer than the amplicon itself! The polymerase will read through the amplicon, the primer, the barcode, and on into the adapter sequence. This is non-biological DNA that will cause major issues downstream, e.g., with sequence classification. So we want to trim primers from either end of the sequence to eliminate read-through issues. Enter cutadapt. Note that we trim the forward primer and the reverse complement of the reverse primer from the forward reads (the forward primers have already been trimmed in the raw reads, but we will demonstrate forward + reverse trimming here since attempting to trim the forward read will not hurt). We trim the reverse primer and reverse complement of the forward primer from the reverse reads.
qiime cutadapt trim-paired \
--i-demultiplexed-sequences demux.qza \
--p-adapter-f AYTTAAGCATATCAATAAGCGGAGGCT \
--p-front-f AACTTTYRRCAAYGGATCWCT \
--p-adapter-r AGWGATCCRTTGYYRAAAGTT \
--p-front-r AGCCTCCGCTTATTGATATGCTTAART \
--o-trimmed-sequences demux-trimmed.qza
qiime demux summarize \
--i-data demux-trimmed.qza \
--o-visualization demux-trimmed.qzv
Compare the demux-trimmed.qzv
summary to the demux.qzv
summary that we made before trimming. It doesn't look like the sequences were trimmed, does it? According to Taylor et al. 2016 (the source of the primers and mock community used here), the predicted amplicon length range of the 5.8S-Fun + ITS4-Fun primer set is 267-511 bp (see Figure 1). So with the read lengths we are using here (250 nt) we should not see any read-through.
Denoising
Even though we have paired-end reads, the read quality in the mock-25 dataset is not good enough to support merging the expected amplicon length (See the "interactive quality plots" tab in demux.qzv
to visualize how quality drops off in both forward and reverse reads.) We can use qiime dada2 denoise-paired
to confirm that we lose the majority of our reads at the merging stage (I have already confirmed this). The best solution in this case is to proceed with denoise-single
(using only the forward reads).
qiime dada2 denoise-single \
--i-demultiplexed-seqs demux-trimmed.qza \
--p-trim-left 13 \
--p-trunc-len 160 \
--o-representative-sequences dada2-single-end-rep-seqs.qza \
--o-table dada2-single-end-table.qza \
--o-denoising-stats dada2-single-end-stats.qza
qiime metadata tabulate \
--m-input-file dada2-single-end-stats.qza \
--o-visualization dada2-single-end-stats.qzv
In your own data you should follow this process to determine whether you have enough sequence length to proceed with pairend-end sequences, or if you must proceed with single-end reads due to insufficient read overlap (as we have done above):
- Determine the expected amplicon length for your primers
- Check the amplicon length and assess the quality of your reads (use the
demux.qzv
ordemux-trimmed.qzv
summaries) - Are your (trimmed) reads long enough to overlap? dada2 expects a minimum overlap of 12 nt, so your reads are long enough if: length(forward read) + length(reverse read) ≥ 12 nt.
- If your answer to #3 is "definitely not!", proceed with single-end reads as we have shown here. If your answer to #3 is "maybe?", give paired-end denoising a shot but pay very close attention to how many reads are being lost after merging (check out the stats file that is output by dada2).
Taxonomic classification
Now we will use the taxonomy classifier that we trained above to predict the taxonomic assignment for each of our denoised sequence variants.
qiime feature-classifier classify-sklearn \
--i-classifier unite-ver7-99-classifier-01.12.2017.qza \
--i-reads dada2-single-end-rep-seqs.qza \
--o-classification taxonomy-single-end.qza
qiime taxa barplot \
--i-table dada2-single-end-table.qza \
--i-taxonomy taxonomy-single-end.qza \
--m-metadata-file mock-25-sample-metadata.tsv \
--o-visualization taxa-bar-plots.qzv
taxa-bar-plots.qzv (328.4 KB)
Other analyses
Now that you have a feature table and taxonomy classifications, you can typically proceed with analysis of fungal ITS (and other marker-gene data) in the same way that you would 16S rRNA gene data. With one major exception. Although this marker provides more accurate species identification than other fungal marker genes, alignments of ITS sequences are not useful for informing evolutionary distances among distantly related species (but can yield good alignments for closely related sequences). So you have two options for diversity and other potentially phylogenetic analyses:
- do not use phylogenetic methods with your ITS data. Instead, use non-phylogenetic methods diversity analyses, such as Jaccard (qualitative) or Bray Curtis (quantitative) for estimating beta diversity.
- Use q2-ghost-tree to infer phylogeny via grafting ITS sequences. This will enable use of phylogenetic methods, including UniFrac. Edit 2024: ghosttree is no longer maintained, so is no longer a solution for estimating phylogenies from ITS sequences
Many other tutorials (e.g., the overview for beginners) outline some of the possible downstream analyses possible with any feature table (in theory). So we will not revisit these options here. Let's talk about something else instead.
Evaluate accuracy
NOTE: If you are following this tutorial to analyze your own data (of real samples), stop here. The following section is specific for this tutorial and other mock communities. This is not a meaningful analysis of real samples.
We just analyzed a mock community, meaning that we know the expected composition of that sample's taxonomic abundances and sequences. The expected composition would be precisely what our inputs were, converted to relative abundances. Taxonomic names are mapped to the nearest match in our reference sequence database of choice (the same one we used for taxonomic classification) — something that mockrobiota already does for us. So we can use mock communities like this to evaluate he accuracy of different bioinformatics methods or pipelines. Let's compare the methods we used to see how well they did!
First we will grab the expected composition of this mock community:
wget https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-25/unite/7-1/99-otus/expected-taxonomy.tsv
That file was annotated with an older version of the UNITE database, so let's fix some of the annotations (to match the version we are using) before proceeding:
sed 's/s__Tylospora_asterophora/s__unidentified/' expected-taxonomy.tsv | sed 's/p__Zygomycota;c__Mortierellomycotina_cls_Incertae_sedis/p__Mortierellomycota;c__Mortierellomycetes/' | sed 's/c__Chytridiomycetes/c__Spizellomycetes/' | sed 's/g__Coprinopsis/g__Coprinopsis;s__unidentified/' | sed 's/g__Tricholoma/g__Tricholoma;s__Tricholoma_vaccinum/'> expected-taxonomy-mod.tsv
Convert to biom and import
biom convert \
-i expected-taxonomy-mod.tsv \
-o expected-taxonomy.biom \
--table-type="OTU table" \
--to-json
qiime tools import \
--type FeatureTable[RelativeFrequency] \
--input-path expected-taxonomy.biom \
--input-format BIOMV100Format \
--output-path expected-taxonomy.qza
Collapse our features on taxonomy and convert our feature table of observations to relative frequency:
qiime taxa collapse \
--i-table dada2-single-end-table.qza \
--i-taxonomy taxonomy-single-end.qza \
--p-level 7 \
--o-collapsed-table dada2-single-end-table-collapsed.qza
qiime feature-table relative-frequency \
--i-table dada2-single-end-table-collapsed.qza \
--o-relative-frequency-table dada2-single-end-table-relative.qza
Evaluate accuracy:
qiime quality-control evaluate-composition \
--i-expected-features expected-taxonomy.qza \
--i-observed-features dada2-single-end-table-relative.qza \
--o-visualization dada2-single-end-evaluation.qzv
dada2-single-end-evaluation.qzv (357.4 KB)
What does it all mean? You can read more about TAR and TDR scores here. Other metrics should be more familiar to you.
What this is showing: We do pretty well! At least as far as TDR is concerned. TDR is high at species level, indicating a very low number of false negatives (see list at bottom). TAR increases with sequencing depth (as does # of observed species/expected), indicating an increasing number of false positives. What are we missing? Which species are being misclassified and why?
BONUS: we can use this evaluation to compare multiple methods. How does dada2 single end with cutadapt trimming compare to no trimming or other methods?