Q2-ITSxpress: a QIIME 2 plugin to trim ITS sequences
Adam R. Rivers - USDA Agricultural Research Service
Background
The internally transcribed spacer (ITS) region a widely used phylogenetic marker for fungi and other taxa. Previous work by Nilsson et al. (2009) showed that removing the conserved regions around the ITS results in more accurate taxonomic classification. An existing program, ITSx, can trim FASTA sequences by matching HMM profiles to the ends of the flanking conserved genes. ITSxpress is designed to extend this technique to trim the FASTQ files needed for the newer exact sequence variant methods used by in QIIME 2: Dada2 and Deblur. ITSxpress processes QIIME artifacts of the type SampleData[PairedEndSequencesWithQuality]
or SampleData[SequencesWithQuality]
.
The plugin:
- Merges reads (if paired-end) using BBMerge
- Temporarily clusters highly similar sequences that are common in amplicon data using VSEARCH
- Identifies the ITS start and stop sites using Hmmsearch on the representative sequences
- Trims each original, merged sequence with quality scores, returning the merged or unmerged sequences with quality scores in a
.qza
file
ITSxpress speeds up the trimming of reads by a factor of 14-23 times on a 4-core computer by temporarily clustering highly similar sequences that are common in amplicon data and utilizing optimized parameters for Hmmsearch. For more information see the paper.
ITSxpress is also available as a stand-alone software package from Github, PyPi and Bioconda.
Installation
The instructions assume that you installed QIIME 2 natively using Conda and are using ITSxpress version 1.7.0.
Activate the QIIME 2 Conda environment.
source activate qiime2-2018.8
Install ITSxpress using Bioconda and Q2-itsxpress using pip. Be sure to install ITSxpress and Q2-itsxpress in the QIIME 2 environment, meaning you ran the step above first.
conda install -c bioconda itsxpress
pip install q2-itsxpress
In your QIIME2 environment, refresh the plugins.
qiime dev refresh-cache
Check to see if the ITSxpress plugin is installed. After running this command you should see a basic help menu.
qiime itsxpress
Tutorial
Note: this tutorial was updated for ITSxpress 1.7.0 on 08/13/2018.
Recommendations about how to trim reads for use by Dada2 have changed.
This tutorial walks the user through the first portion of a typical ITS workflow:
- Trimming the ITS region with ITSxpress
- Calling sequence variants with Dada2 or Deblur
- Training the QIIME 2 classifier
- Classifying the sequences taxonomically
For this tutorial we will be starting with two paired-end samples than have already been demultiplexed into froward and reverse FASTQ files. A manifest file which lists the samples, files and read orientation is also used. The example manifest uses the $PWD variable to complete the path for your computer. If you have issues you can replace it with the direct path.
Example data
We will be using data from two soil samples which have have their ITS1 region amplified with fungal primers. They have been subsampled to 10,000 read pairs for faster processing.
- sample1_r1.fq.gz and sample1_r2.fq.gz
- sample2_r1.fq.gz and sample2_r2.fq.gz
- A manifest file: manifest.txt
- A mapping file: mapping.txt
If you have the command line program wget
you can download the data with these commands
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample1_r1.fq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample1_r2.fq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample2_r1.fq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample2_r2.fq.gz
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/manifest.txt
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/mapping.txt
Import the sequence data
Make sure all the data files are in the same directory, then import the data into QIIME.
This step in the tutorial imports demultiplexed data into QIIME.
NOTE: If you have multiplexed data in a format like
EMPPairedEndSequences
you will need to demultiplex it first using thedemux
plugin. For a paired-end example see example see this tutorial.
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-format PairedEndFastqManifestPhred33\
--input-path manifest.txt \
--output-path sequences.qza
Run time: 4 seconds
We can see the quality of the data by running the summarize command.
qiime demux summarize \
--i-data sequences.qza \
--o-visualization sequences.qzv
Run time: 4 seconds
Trimming ITS samples with Q2-ITSxpress for Dada2
ITSxpress trim-pair-output-unmerged
takes paired-end QIIME artifacts
SampleData[PairedEndSequencesWithQuality]
for
trimming. It merges the reads, temporally clusters the reads, then looks for
the ends of the ITS region with Hmmsearch. HMM models are available for 18
different clades. itsxpress trim-pair-output-unmerged
returns the unmerged, trimmed sequences.
qiime itsxpress trim-pair-output-unmerged\
--i-per-sample-sequences sequences.qza \
--p-region ITS1 \
--p-taxa F \
--o-trimmed trimmed.qza
qiime itsxpress trim-pair-output-unmerged\
--i-per-sample-sequences sequences.qza \
--p-region ITS1 \
--p-taxa F \
--p-cluster-id 1.0 \
--p-threads 2 \
--o-trimmed trimmed_exact.qza
Run time: 2 minutes 45 seconds
Use Dada2 to identify sequence variants
The trimmed sequences can be fed directly into Dada2 using the denoise-paired
command. Since BBmerge handled the merging and quality issues there is no need to trim or truncate the reads further. In this tutorial we have set a truncation length \ to 0 because the data quality was good. Be sure to examine the sequences.qzv
file before deciding to hard trim your reads.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs trimmed.qza \
--p-trunc-len-r 0 \
--p-trunc-len-f 0 \
--output-dir dada2out
qiime dada2 denoise-single \
--i-demultiplexed-seqs db_trimmed.qza \
--p-trunc-len-f 0 \
--output-dir dada2wrongout
Run time: 1 minute
-
Output:
dada2out/denoising_stats.qza
View | Downloaddada2out/representative_sequences.qza
View | Downloaddada2out/table.qza
View | Download
Summarize the data for visual inspection:
qiime feature-table summarize \ --i-table dada2out/table.qza \ --o-visualization tableviz.qzv
Run time: 4 seconds
Deblur is an alternative option for read correction. This tutorial uses Dada2 Because deblur requires uniform length reads, specified by the --p-trim-length flags, and ITS regions vary considerably in length. Tests across a range of trim lengths using Deblur yielded fewer sequence variants.
Download reference data from UNITE for fungal classification
First download the newest UNITE database for QIIME and unzip the file.
wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
unzip 0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
Import the latest UNITE data into QIIME 2:
Import the UNITE sequences for the smaller dataset selected with dynamic thresholds determined by fungal experts.
There has been discussion about whether trimming the database matters for classification. The QIIME team found that trimming the UNITE database does not result in better classification when untrimmed reads are used and recommended using the untrimmed developer database. Since we are using the trimmed ITS region, this tutorial recommends using the trimmed database but this has not yet been systematically compared.
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path sh_refs_qiime_ver7_dynamic_01.12.2017.fasta \
--output-path unite.qza
Run time 7 seconds
Import the associated UNITE taxonomy file.
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path sh_taxonomy_qiime_ver7_dynamic_01.12.2017.txt \
--output-path unite-taxonomy.qza
Run time 4 seconds
Train the QIIME classifier
QIIME provides its own naive Bayes classifier similar to RDP from the python package SciKit Learn. Before using it the classifier must be trained using the data you just imported.
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads unite.qza \
--i-reference-taxonomy unite-taxonomy.qza \
--o-classifier classifier.qza
Run time: 5 minutes
Classify the sequence variants
Once the classifier is trained sequences can be classified.
qiime feature-classifier classify-sklearn \
--i-classifier classifier.qza \
--i-reads dada2out/representative_sequences.qza \
--o-classification taxonomy.qza
Run time: 1.5 minutes
Summarize the results
Summarize the results for visualization in the QIIME 2 viewer.
qiime metadata tabulate \
--m-input-file taxonomy.qza \
--o-visualization taxonomy.qzv
Run time: 4 seconds
Create an interactive bar plot figure
qiime taxa barplot \
--i-table dada2out/table.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file mapping.txt \
--o-visualization taxa-bar-plots.qzv
Run time: 4 seconds
This tutorial provides the basic process for analyzing ITS sequences. The data is now in a form where it can be analyzed further using many of the other methods provided by QIIME 2.
Citation information for ITSxpress
-
Rivers AR, Weber KC, Gardner TG et al. ITSxpress: Software to rapidly trim internally transcribed spacer sequences with quality scores for marker gene analysis [version 1; referees: awaiting peer review]. F1000Research 2018, 7:1418
doi: 10.12688/f1000research.15704.1 -
ITSxpress software: DOI:10.5281/zenodo.1304348
-
ITSxpress QIIME 2 plugin: DOI:10.5281/zenodo.1317578