Q2-ITSxpress: A tutorial on a QIIME 2 plugin to trim ITS sequences

Q2-ITSxpress: a QIIME 2 plugin to trim ITS sequences

Adam R. Rivers - USDA Agricultural Research Service

Background

The internally transcribed spacer (ITS) region a widely used phylogenetic marker for fungi and other taxa. Previous work by Nilsson et al. (2009) showed that removing the conserved regions around the ITS results in more accurate taxonomic classification. An existing program, ITSx, can trim FASTA sequences by matching HMM profiles to the ends of the flanking conserved genes. ITSxpress is designed to extend this technique to trim the FASTQ files needed for the newer exact sequence variant methods used by in QIIME 2: Dada2 and Deblur. ITSxpress processes QIIME artifacts of the type SampleData[PairedEndSequencesWithQuality] or SampleData[SequencesWithQuality].

The plugin:

  1. Merges reads (if paired-end) using BBMerge
  2. Temporarily clusters highly similar sequences that are common in amplicon data using VSEARCH
  3. Identifies the ITS start and stop sites using Hmmsearch on the representative sequences
  4. Trims each original, merged sequence with quality scores, returning the merged or unmerged sequences with quality scores in a .qza file

ITSxpress speeds up the trimming of reads by a factor of 14-23 times on a 4-core computer by temporarily clustering highly similar sequences that are common in amplicon data and utilizing optimized parameters for Hmmsearch. For more information see the paper.

ITSxpress is also available as a stand-alone software package from Github, PyPi and Bioconda.

Installation

The instructions assume that you installed QIIME 2 natively using Conda and are using ITSxpress version 1.7.0.

Activate the QIIME 2 Conda environment.

source activate qiime2-2018.8

Install ITSxpress using Bioconda and Q2-itsxpress using pip. Be sure to install ITSxpress and Q2-itsxpress in the QIIME 2 environment, meaning you ran the step above first.

conda install -c bioconda itsxpress
pip install q2-itsxpress

In your QIIME2 environment, refresh the plugins.

qiime dev refresh-cache

Check to see if the ITSxpress plugin is installed. After running this command you should see a basic help menu.

qiime itsxpress

Tutorial

Note: this tutorial was updated for ITSxpress 1.7.0 on 08/13/2018.
Recommendations about how to trim reads for use by Dada2 have changed.

This tutorial walks the user through the first portion of a typical ITS workflow:

  1. Trimming the ITS region with ITSxpress
  2. Calling sequence variants with Dada2 or Deblur
  3. Training the QIIME 2 classifier
  4. Classifying the sequences taxonomically

For this tutorial we will be starting with two paired-end samples than have already been demultiplexed into froward and reverse FASTQ files. A manifest file which lists the samples, files and read orientation is also used. The example manifest uses the $PWD variable to complete the path for your computer. If you have issues you can replace it with the direct path.

Example data

We will be using data from two soil samples which have have their ITS1 region amplified with fungal primers. They have been subsampled to 10,000 read pairs for faster processing.

If you have the command line program wget you can download the data with these commands

wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample1_r1.fq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample1_r2.fq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample2_r1.fq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample2_r2.fq.gz
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/manifest.txt
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/mapping.txt

Import the sequence data

Make sure all the data files are in the same directory, then import the data into QIIME.

This step in the tutorial imports demultiplexed data into QIIME.

NOTE: If you have multiplexed data in a format like EMPPairedEndSequences you will need to demultiplex it first using the demux plugin. For a paired-end example see example see this tutorial.

qiime tools import \
  --type SampleData[PairedEndSequencesWithQuality] \
  --input-format PairedEndFastqManifestPhred33\
  --input-path manifest.txt \
  --output-path sequences.qza

Run time: 4 seconds

We can see the quality of the data by running the summarize command.

qiime demux summarize \
  --i-data sequences.qza \
  --o-visualization sequences.qzv

Run time: 4 seconds

Trimming ITS samples with Q2-ITSxpress for Dada2

ITSxpress trim-pair-output-unmerged takes paired-end QIIME artifacts
SampleData[PairedEndSequencesWithQuality] for
trimming. It merges the reads, temporally clusters the reads, then looks for
the ends of the ITS region with Hmmsearch. HMM models are available for 18
different clades. itsxpress trim-pair-output-unmerged returns the unmerged, trimmed sequences.

qiime itsxpress trim-pair-output-unmerged\
  --i-per-sample-sequences sequences.qza \
  --p-region ITS1 \
  --p-taxa F \
  --o-trimmed trimmed.qza
qiime itsxpress trim-pair-output-unmerged\
  --i-per-sample-sequences sequences.qza \
  --p-region ITS1 \
  --p-taxa F \
  --p-cluster-id 1.0 \
  --p-threads 2 \
  --o-trimmed trimmed_exact.qza

Run time: 2 minutes 45 seconds

Use Dada2 to identify sequence variants

The trimmed sequences can be fed directly into Dada2 using the denoise-paired
command. Since BBmerge handled the merging and quality issues there is no need to trim or truncate the reads further. In this tutorial we have set a truncation length \ to 0 because the data quality was good. Be sure to examine the sequences.qzv file before deciding to hard trim your reads.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs trimmed.qza \
  --p-trunc-len-r 0 \
  --p-trunc-len-f 0 \
  --output-dir dada2out
qiime dada2 denoise-single \
  --i-demultiplexed-seqs db_trimmed.qza \
  --p-trunc-len-f 0 \
  --output-dir dada2wrongout

Run time: 1 minute

  • Output:

    1. dada2out/denoising_stats.qza View | Download
    2. dada2out/representative_sequences.qza View | Download
    3. dada2out/table.qza View | Download

    Summarize the data for visual inspection:

    qiime feature-table summarize \
      --i-table dada2out/table.qza \
      --o-visualization tableviz.qzv
    

    Run time: 4 seconds

Deblur is an alternative option for read correction. This tutorial uses Dada2 Because deblur requires uniform length reads, specified by the --p-trim-length flags, and ITS regions vary considerably in length. Tests across a range of trim lengths using Deblur yielded fewer sequence variants.

Download reference data from UNITE for fungal classification

First download the newest UNITE database for QIIME and unzip the file.

wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
unzip 0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip

Import the latest UNITE data into QIIME 2:

Import the UNITE sequences for the smaller dataset selected with dynamic thresholds determined by fungal experts.

There has been discussion about whether trimming the database matters for classification. The QIIME team found that trimming the UNITE database does not result in better classification when untrimmed reads are used and recommended using the untrimmed developer database. Since we are using the trimmed ITS region, this tutorial recommends using the trimmed database but this has not yet been systematically compared.

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path sh_refs_qiime_ver7_dynamic_01.12.2017.fasta \
  --output-path unite.qza

Run time 7 seconds

Import the associated UNITE taxonomy file.

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path sh_taxonomy_qiime_ver7_dynamic_01.12.2017.txt \
  --output-path unite-taxonomy.qza

Run time 4 seconds

Train the QIIME classifier

QIIME provides its own naive Bayes classifier similar to RDP from the python package SciKit Learn. Before using it the classifier must be trained using the data you just imported.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads unite.qza \
  --i-reference-taxonomy unite-taxonomy.qza \
  --o-classifier classifier.qza

Run time: 5 minutes

Classify the sequence variants

Once the classifier is trained sequences can be classified.

qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads dada2out/representative_sequences.qza \
  --o-classification taxonomy.qza

Run time: 1.5 minutes

Summarize the results

Summarize the results for visualization in the QIIME 2 viewer.

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Run time: 4 seconds

Create an interactive bar plot figure

qiime taxa barplot \
  --i-table dada2out/table.qza  \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file mapping.txt \
  --o-visualization taxa-bar-plots.qzv

Run time: 4 seconds

This tutorial provides the basic process for analyzing ITS sequences. The data is now in a form where it can be analyzed further using many of the other methods provided by QIIME 2.

Citation information for ITSxpress

16 Likes

Thank you for the excellent tutorial! When trying to install ITxpress within my Q2 environment I get this error:

(qiime2-2018.6) mpg-bullington-mba-2:~ lbullington$ conda install q2-itsxpress

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

- q2-itsxpress

Current channels:

- bioconda/osx-64

- bioconda/noarch

- main/osx-64

- main/noarch

- Anaconda packages for MacOSX x86_64 (64-bit)

- Anaconda packages (noarch)

- r/osx-64

- r/noarch

- Anaconda extras for MacOSX x86_64 (64-bit)

- Anaconda extras (noarch)

To search for alternate channels that may provide the conda package you're

looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Any suggestions?

Looks like bioconda didn’t get added correctly. Try:

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Thanks for your time on this. :slight_smile: I have uninstalled and re-installed qiime2, and I ran the commands you suggested but still get a similar error:

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

- q2-itsxpress
Current channels:

- conda-forge/osx-64

- conda-forge/noarch

- bioconda/osx-64

- bioconda/noarch

- main/osx-64

- main/noarch

- Anaconda packages for MacOSX x86_64 (64-bit)

- Anaconda packages (noarch)

- r/osx-64

- r/noarch

- Anaconda extras for MacOSX x86_64 (64-bit)

- Anaconda extras (noarch)

Sorry, I made a mistake in the install instructions but I just updated the tutorial to correct the issue. The correct installation commands are:

conda config --add channels bioconda
conda install itsxpress
pip install q2-itsxpress
1 Like

This install works fine for me now. The plugin however, doesn’t appear to be compatible with ‘EMPPairedEndSequences’ type files as produced by the following command:

qiime tools import
–type EMPPairedEndSequences
–input-path raw-seqs
–output-path paired-end-sequences.qza

Where I get this error:

Plugin error from itsxpress:

Argument to parameter ‘per_sample_sequences’ is not a subtype of SampleData[PairedEndSequencesWithQuality].

Debug info has been saved to /var/folders/x4/jr46ngl90px4376_dbktnc6jmk70ht/T/qiime2-q2cli-err-sv581_bv.log

Is this something that this plugin will accept in the future? I am happy to help you troubleshoot things as much as I am able, as I think this is a valuable tool for QIIME (and me :)).

Cheers,

–Lorinda

@Lorinda --- I am not involved with the development of this plugin, but it looks like you are trying to use multiplexed sequences where demultiplexed sequences are expected.

Use demux emp-paired to demux, then provide that is input to itsexpress.

Again, I am not involved with the development of this plugin, but I would suggest that maybe that kind of functionality doesn't make sense in the context of this plugin. I think the general design idea here in QIIME 2 is to compose modular building blocks of functionality. Each plugin/method should do one thing, and one thing well.

Ah I see, I was just incorrectly adopting this to my fit my protocol. It seems to work using the demultiplexed file. Thanks!

1 Like

Thanks @thermokarst, I’ll add an explanatory note on the tutorial explaining that demultiplexed sequences are needed.

2 Likes

@Adam_Rivers, thanks a lot for your work on q2-itsxpress and for this tutorial! I worked through it and I have a few suggestions for improvements. There are only two changes that I think might be critical to make as soon as possible (noted as critical below) - the rest you should just think of as suggestions that you can take or leave.

  1. Would it be possible to provide wget or curl commands for tutorial file downloads? That will help us to automate testing of this tutorial in the future, which will help you to ensure that the commands in the tutorial never become out of date (e.g., because of changing interfaces or broken file download links).

  2. I recommend replacing or expanding your text “paired-end sequences with quality” or “sequences with quality” with literal blocks containing the names of these types (i.e., SampleData[PairedEndSequencesWithQuality] or SampleData[SequencesWithQuality]) as that may be more clear for users.

  3. I recommend replacing your conda config and conda install command with conda install -c bioconda itsxpress. This will then only impact the current installation, and doesn't impact what channels conda will look for at other times (e.g., when the user is using conda to install something unrelated).

  4. It looks like you don't have citations officially associated with your plugin:

    (qiime2-2018.6) 13:25:23 temp$ qiime itsxpress --citations
    No citations found.
    (qiime2-2018.6) 13:25:38 temp$ qiime itsxpress trim-pair --citations
    No citations found.
    

    See here and here for an example of how to add a citation to your plugin. Without this information, users will have a harder time knowing how to cite your work.

  5. I've attached a manifest file (manifest.txt (170 Bytes)) that you could include in your tutorial that won't require the user to make edits before importing if the manifest.txt is in the same directory as the fastq files.

  6. This one is really probably just personal preference, but I recommend making all of the commands in your tutorial single-threaded only (i.e., drop --p-threads 2 and any similar parameters from commands that use them). Since that parameter is a little bit dependent on the environment where the command is running, and since it's not required, I usually just opt to not specify options for parallel processing in tutorials.

  7. critical The merged sequences can be fed directly into Dada2 using the denoise-single command. I don't think this is correct. While this will technically execute I think you're violating assumptions that DADA2 is making about the error profiles, so DADA2 itself won't work as expected. The output semantic type for the trim-pair method should instead be SampleData[JoinedSequencesWithQuality]. These data can be passed to methods which accept this input such as qiime quality-filter q-score-joined or qiime deblur denoise-other, but my understanding is that it is technically incorrect to pass pre-joined data to DADA2. (If you could trim but not join the reads, and output a SampleData[PairedEndSequencesWithQuality], than that should be ok to pass to qiime dada2 denoise-paired.)

  8. critical SampleData[JoinedSequencesWithQuality] should not be an allowed input type to trim-single. This is a bit annoying, and something we're going to fix soon, but because you have to specify the output type as SampleData[SequencesWithQuality], the information that these reads have been joined would be lost in the process of executing trim-single. In the future, methods will support mapping of input types onto output types so this information could be preserved. In the meantime, I recommend just not allowing the user to pass pre-joined sequences to the itsxpress methods, since they can join either during or after running itsxpress.

  9. Looks like you left time in one of your commands: time qiime dada2 denoise-single.

Thanks again for this contribution @Adam_Rivers! We're very excited to have this a part of the QIIME 2 ecosystem! Please let me know if you have any questions about this.

6 Likes

Hello Adam_Rivers
when I used your data ,I got a mistake .
when I use “deblur denoise-16S” to get “table.qza ;rep-seqs.qza” but the two files are empty. but it seems not because of the data

I'm not the Deblur developer, but It looks like qiime deblur deblur-16S is designed for 16S sequences not ITS sequences. You can follow the tutorial and use Dada2 or use qiime deblur denoise-other and provide it an appropriate ITS positive filter file.

The help for this command says.

Usage: qiime deblur denoise-16S [OPTIONS]

Perform sequence quality control for Illumina data using the Deblur
workflow with a 16S reference as a positive filter. Only forward reads are
supported at this time. The specific reference used is the 88% OTUs from
Greengenes 13_8. This mode of operation should only be used when data were
generated from a 16S amplicon protocol on an Illumina platform. The
reference is only used to assess whether each sequence is likely to be 16S
by a local alignment using SortMeRNA with a permissive e-value; the
reference is not used to characterize the sequences.

1 Like

Thanks for reviewing @gregcaporaso . I'll work through the comments. I did have questions about the output data types in point 7.

I was looking at q2_dada2's input types and it looks like:

I don't see anything about the type SampleData[JoinedSequencesWithQuality] being accepted by dada2 is it a subtype or something?

If I did change types, can I just change my data output type from SampleData[SequencesWithQuality] to SampleData[JoinedSequencesWithQuality] without any additional changes?

I think that Dada2 has the ability to learn error rates from single ended data as well as paired end data now, but I don't know the method. @benjjneb can you provide guidance on this? How are error profiles calculated in denoise_single ? Will merged scores interfere with the error rate estimation procedures?

Merging by BBMerge does change the quality scores since most positions are verified by two reads. The score change is shown for one ITS1 sample here:

Changing to a paied end output requires a major rewrite of ITSxpress so I'd like to explore other options first.

I'm not quire sure I understand point 8, but I can remove that input type. Do I then need to add a third command for SampleData[JoinedSequencesWithQuality] that outputs the same type?

Adam

In short, DADA2 does not recommend processing pre-merged reads, because the different regions of pre-merged reads (Forward-only, overlapping, reverse-only) have different relationships between the assigned quality scores and the error rates, which can lead to false-positive ASV inference. You can see more discussion of this here: Fungal ITS pre-processing suggestion ¡ Issue #327 ¡ benjjneb/dada2 ¡ GitHub

This is a bit annoying, and we would like to support pre-merged reads, but when we evaluated this possibility again recently we were not able to attain the same accuracy on such reads as we could in our recommended merge-later workflow.

Oh, that’s a bummer. I saw that Dada2 added single-end support so I went forward with using merging in ITSxpress based on my bad assumption that merged reads could be used equally well.

One solution to the issue could be to use unsupervised HMM training to estimate an emission and transition matrix for the merged and unmerged regions based on the pattern of quality scores. Then the Hmm could be applied to segment the reads and learn three different error rates. It’s not trivial though.

How are error rates learned for unpaired sequences since they cannot be merged? Are similar reads clustered then compared?

I wanted to follow up with a question about Deblur for ITS. @wasade and @gregcaporaso, in general, what are your thoughts on the appropriateness of using merged data in Deblur? How does Deblur handle merged data and does merging impact the performance of the Deblur denoise-other algorithm? Also what is an appropriate positive filter file for ITS regions using denoise-other?

Does ITSxpress work on single-end reads? If so, why not use it to trim ITS in forward/reverse separately, and then denoise with dada2?

deblur can handle pre-merged reads — actually, paired-end reads must be joined prior to passing to q2-deblur.

This is just to perform a rough positive filter. I've used the UNITE sequences clustered at 97% (mostly because pre-clustered seqs existed at that level) but you could probably go lower. For 16S I think the greengenes 88% OTUs are used.

1 Like