Q2-ITSxpress: A tutorial on a QIIME 2 plugin to trim ITS sequences

ITSxpress: a QIIME 2 plugin to trim ITS sequences

Adam R. Rivers - USDA Agricultural Research Service

Sveinn V. Einarsson - Dep. Microbiology and Cell Science, U. Florida & USDA Agricultural Research Service

Background

The internally transcribed spacer (ITS) region a widely used phylogenetic marker for fungi and other taxa. Previous work by Nilsson et al. (2009) showed that removing the conserved regions around the ITS results in more accurate taxonomic classification. An existing program, ITSx, can trim FASTA sequences by matching HMM profiles to the ends of the flanking conserved genes. ITSxpress is designed to extend this technique to trim the FASTQ files needed for the newer exact sequence variant methods used by in QIIME 2: Dada2 and Deblur. ITSxpress processes QIIME artifacts of the type SampleData[PairedEndSequencesWithQuality] or SampleData[SequencesWithQuality].

The plugin:

  1. Merges reads (if paired-end) using VSEARCH
  2. Temporarily clusters highly similar sequences that are common in amplicon data also using VSEARCH
  3. Identifies the ITS start and stop sites using Hmmsearch on the representative sequences
  4. Trims each original sequence with quality scores, returning the single or paired-end reads with quality scores in a .qza file

ITSxpress speeds up the trimming of reads by a factor of 14-23 times on a 4-core computer by temporarily clustering highly similar sequences that are common in amplicon data and utilizing optimized parameters for Hmmsearch. For more information see the paper.

ITSxpress is installed as a standalone package with the Qiime2 ITSxpress plugin, IF Qiime2 is available. Package can be installed fromBioconda and Github.

PIP (PyPI) is no longer maintained for ITSxpress>=v2.0.0.

Installation

The instructions assume that you installed QIIME 2 natively using Mamba (or Conda) and are using ITSxpress version 2.0.0 or greater. Please see specific operating system instructions for your computer.

Activate the QIIME 2 Conda environment.

mamba activate qiime2-2024.2

Install ITSxpress using Bioconda. Be sure to install ITSxpress in the QIIME 2 environment, meaning you ran the steps above first. ITSxpress no longer has a seperate Qiime plugin, it is included with the standalone version IF you already have Qiime2 installed in your environment.

mamba install -c bioconda itsxpress

If you have Apple Silicon (M1,M2, etc.) you'll have to do the following to use Rosetta translation:

CONDA_SUBDIR=osx-64 mamba create -n itsxpressenv -c bioconda -c conda-forge itsxpress
mamba activate itsxpressenv
conda  config --env --set subdir osx-64

In your QIIME2 environment, refresh the plugins.

qiime dev refresh-cache

Check to see if the ITSxpress plugin is installed. After running this command you should see a basic help menu.

qiime itsxpress

Note: this tutorial was updated for ITSxpress 2.1.0 on 4/15/2024.

This tutorial walks the user through the first portion of a typical ITS workflow:

  1. Trimming the ITS region with ITSxpress
  2. Calling sequence variants with Dada2 or Deblur
  3. Training the QIIME 2 classifier
  4. Classifying the sequences taxonomically

For this tutorial we will be starting with two paired-end samples than have already been demultiplexed into forward and reverse FASTQ files. A manifest file which lists the samples, files and read orientation is also used. The example manifest uses the $PWD variable to complete the path for your computer. If you have issues you can replace it with the direct path.

Below the following tutorial you can find an experimental walkthrough of Pacbio ITS analysis.

Example data

We will be using data from two soil samples which have have their ITS2 region amplified with fungal primers. They have been subsampled to 10,000 read pairs for faster processing.

If you have the command line program wget you can download the data with these commands

wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample1_r1.fastq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample1_r2.fastq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample2_r1.fastq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/sample2_r2.fastq.gz
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/manifest.txt
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/mapping.txt

Import the sequence data

Make sure all the data files are in the same directory, then import the data into QIIME.

This step in the tutorial imports demultiplexed data into QIIME.

NOTE: If you have multiplexed data in a format like EMPPairedEndSequences you will need to demultiplex it first using the demux plugin. For a paired-end example see example see this tutorial.

qiime tools import \
  --type SampleData[PairedEndSequencesWithQuality] \
  --input-format PairedEndFastqManifestPhred33\
  --input-path manifest.txt \
  --output-path sequences.qza

Run time: 5 seconds

We can see the quality of the data by running the summarize command.

qiime demux summarize \
  --i-data sequences.qza \
  --o-visualization sequences.qzv

Run time: 9 seconds

Trimming ITS samples with ITSxpress for Dada2

ITSxpress trim-pair-output-unmerged takes paired-end QIIME artifacts
SampleData[PairedEndSequencesWithQuality] for
trimming. It merges the reads, temporally clusters the reads, then looks for
the ends of the ITS region with Hmmsearch. HMM models are available for 18
different clades. itsxpress trim-pair-output-unmerged returns the unmerged, trimmed sequences. itsxpress trim-pair-output-merged returns merged, trimmed sequences. You can adjust the --p-cluster-id value, which is the percent identity for clustering reads range [0.995-1.0], set to 1 for exact de-replication.

qiime itsxpress trim-pair-output-unmerged\
  --i-per-sample-sequences sequences.qza \
  --p-region ITS2 \
  --p-taxa F \
  --p-cluster-id 1.0 \
  --p-threads 16 \
  --o-trimmed trimmed_exact.qza

Run time: 23 seconds

The following command clusters 99.5% sequence similarity. We recommend clustering to 100% similarity, as the speed benefit is negligible.

qiime itsxpress trim-pair-output-unmerged\
  --i-per-sample-sequences sequences.qza \
  --p-region ITS2 \
  --p-taxa F \
  --p-cluster-id 0.995 \
  --p-threads 16 \
  --o-trimmed trimmed.qza

Run time: 15 seconds

Use Dada2 to identify sequence variants

The trimmed sequences can be fed directly into Dada2 using the denoise-paired
command. Since Vsearch handled the merging and quality issues there is no need to trim or truncate the reads further. In this tutorial we have set a truncation length \ to 0 because the data quality was good. Be sure to examine the sequences.qzv file before deciding to hard trim your reads.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs trimmed_exact.qza \
  --p-trunc-len-r 0 \
  --p-trunc-len-f 0 \
  --output-dir dada2out

Run time: 10 seconds

  • Output:

    1. dada2out/denoising_stats.qza View | Download
    2. dada2out/representative_sequences.qza View | Download
    3. dada2out/table.qza View | Download

    Summarize the data for visual inspection:

    qiime feature-table summarize \
      --i-table dada2out/table.qza \
      --o-visualization tableviz.qzv
    

    Run time: 5 seconds

Deblur is an alternative option for read correction. This tutorial uses Dada2 Because deblur requires uniform length reads, specified by the --p-trim-length flags, and ITS regions vary considerably in length. Tests across a range of trim lengths using Deblur yielded fewer sequence variants.

Download reference data from UNITE for fungal classification

First download the newest UNITE database for QIIME and unzip the file.

wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
unzip 0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip

Import the latest UNITE data into QIIME 2:

Import the UNITE sequences for the smaller dataset selected with dynamic thresholds determined by fungal experts.

There has been discussion about whether trimming the database matters for classification. The QIIME team found that trimming the UNITE database does not result in better classification when untrimmed reads are used and recommended using the untrimmed developer database. Since we are using the trimmed ITS region, this tutorial recommends using the trimmed database but this has not yet been systematically compared.

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path sh_refs_qiime_ver9_dynamic_29.11.2022.fasta \
  --output-path unite.qza

Run time 10 seconds

Import the associated UNITE taxonomy file.

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path sh_taxonomy_qiime_ver9_dynamic_29.11.2022.txt \
--output-path unite-taxonomy.qza

Run time 5 seconds

Train the QIIME classifier

QIIME provides its own naive Bayes classifier similar to RDP from the python package SciKit Learn. Before using it the classifier must be trained using the data you just imported.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads unite.qza \
  --i-reference-taxonomy unite-taxonomy.qza \
  --o-classifier classifier.qza

Run time: 19 minutes

Note: If you come accross an error saying "[Errno 28] No space left on device". You may need to empty your qiime2 temp directory. See discussion here: Temp directory space issue

Classify the sequence variants

Once the classifier is trained sequences can be classified.

qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads dada2out/representative_sequences.qza \
  --o-classification taxonomy.qza

Run time: 49 seconds

Summarize the results

Summarize the results for visualization in the QIIME 2 viewer.

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Run time: 4 seconds

Create an interactive bar plot figure

qiime taxa barplot \
  --i-table dada2out/table.qza  \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file mapping.txt \
  --o-visualization taxa-bar-plots.qzv

Run time: 4 seconds

This tutorial provides the basic process for analyzing ITS sequences. The data is now in a form where it can be analyzed further using many of the other methods provided by QIIME 2.

Updates to ITSxpress v2.x.x from v1.x.x

ITSxpress v2.0.0 is a major update to the ITSxpress package.
Release Highlights

  • pip (PyPI) is no longer maintained for ITSxpress>=v2.0.0

    • Qiime2 plugin version of ITSxpress is now part of the standalone package of ITSxpress
    • Qiime2 needs to be installed first
      - During ITSxpress installation a check is performed to see if Qiime2 is installed. The Qiime2 ITSxpress plugin is installed IF Qiime2 is installed.
    • Seperate Qiime2 plugin version (PyPI) of ITSxpress is no longer maintained after q2-itsxpress v1.8.1 and should not be installed unless
    • Package can be installed from Github, and Bioconda
  • Removed BBmap dependency

    • BBmap scripts are no longer used in the pipeline, including:
      • reformat.sh (interleaved files no longer supported)
      • bbmerge.sh (merging of paired-end reads now done with Vsearch --fastq_mergepairs)
        • merging of paired-end reads is different between Vsearch and BBmerge, so results may differ
  • Updated dereplication step for newer versions of Vsearch

    • Dereplication step now done using Vsearch --fastx_uniques (derep_fulllength command no longer supports fastq files)
  • Pacbio sequences are now supported if fastq file scores are in Illumina format

Bug Fixes

  • Fixed bug where the q2-itsxpress plugin was not handling single-end reads correctly, and was looking for a reverse read file
  • Fixed a bug that could cause crashes when an intermediate file was empty

Previous release highlights since tutorial was written:

  • 1.8.1 is the final version that uses BBmap scripts. This version is still available on the EOL-1.8.1 branch of the ITSxpress repository
  • Fixed version of dependencies for version 1.8.1, to maintain compatibility with Qiime2 2022.8
  • Updated pip install config file to pyproject.toml
  • Updated readme usage section to reference compatible Qiime2 version
  • Added read count output to log file
  • Added support for primer sets in the reverse orientation

BannerForTutorial

Below is a tutorial for Pacbio data. However, due to the variable nature of PACBIO read scoring and file types this is not maintained. This may be revisted in the future.

You can find the paper this test sample is from here: [Runnel et al. 2022] (https://doi.org/10.1111/1755-0998.13663)

Example Pacbio data

We will be using data from amplifying fungal soil samples which have have the entire ITS1/5.8S/ITS2 region amplified with fungal primers.

If you have the command line program wget you can download the data with these commands

wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/pacbio/pacbio_samp1.fastq.gz
wget https://github.com/USDA-ARS-GBRU/itsxpress-tutorial/raw/master/data/pacbio/pacbio_samp1_reformat.fastq.gz
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/pacbio/manifest_pacbio.txt
wget https://raw.githubusercontent.com/USDA-ARS-GBRU/itsxpress-tutorial/master/data/pacbio/mapping_pacbio.txt

Import the sequence data

Make sure all the data files are in the same directory, then import the data into QIIME.

This step in the tutorial imports demultiplexed data into QIIME.

NOTE: If you have multiplexed data in a format like EMPPairedEndSequences you will need to demultiplex it first using the demux plugin. For a paired-end example see example see this tutorial.

NOTE: The fastq file from Pacbio has had the Pacbio scoring reformated to Illumina scoring convention, using bbmap reformat.sh:

reformat.sh in=./pacbio/pacbio_samp1.fastq.gz out=pacbio_samp1_reformat.fastq.gz mincalledquality=2 maxcalledquality=41 qin=33
qiime tools import \
--input-path ./pacbio/manifest_pacbio.txt \
--input-format SingleEndFastqManifestPhred33 \
--type SampleData[SequencesWithQuality] \
--output-path ./pacbio/sequences_pacbio.qza

Run time: 5 seconds

We can see the quality of the data by running the summarize command.

qiime demux summarize   \
--i-data ./pacbio/sequences_pacbio.qza   \
--o-visualization ./pacbio/sequences_pacbio.qzv

Run time: 9 seconds

Trimming ITS samples with ITSxpress for Dada2

Since Pacbio long range sequencing allows for amplification of the entire ITS1/5.8S/ITS2 region, you can change the --p-region command below between ITS1, ITS2, and ALL to trim to the corresponding region and see the difference in taxonomic output.

qiime itsxpress trim-single \
--i-per-sample-sequences ./pacbio/sequences_pacbio.qza \
--p-region ITS2 \
--p-taxa F \
--p-cluster-id 1.0 \
--p-threads 16 \
--o-trimmed ./pacbio/trimmed_pacbio.qza

Run time: 15 seconds

Use Dada2 to identify sequence variants

The trimmed sequences can be fed directly into Dada2 using the denoise-single
command. Be sure to examine the sequences_pacbio.qzv file before deciding to hard trim your reads.

qiime dada2 denoise-single \
--i-demultiplexed-seqs ./pacbio/trimmed_pacbio.qza \
--p-trunc-len 0 \
--p-n-threads 16 \
--p-max-ee 20 \
--output-dir ./pacbio/dada2out

Run time: 10 seconds

  • Output:

    1. dada2out/denoising_stats.qza View | Download
    2. dada2out/representative_sequences.qza View | Download
    3. dada2out/table.qza View | Download

    Summarize the data for visual inspection:

    qiime feature-table summarize \
      --i-table ./pacbio/dada2out/table.qza \
      --o-visualization ./pacbio/tableviz.qzv
    

    Run time: 5 seconds

Deblur is an alternative option for read correction. This tutorial uses Dada2 Because deblur requires uniform length reads, specified by the --p-trim-length flags, and ITS regions vary considerably in length. Tests across a range of trim lengths using Deblur yielded fewer sequence variants.

The following section is exactly the same as the above tutorial using paired-end example data. Feel free to use classifier developed above.

Download reference data from UNITE for fungal classification

First download the newest UNITE database for QIIME and unzip the file.

wget https://files.plutof.ut.ee/doi/0A/0B/0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip
unzip 0A0B25526F599E87A1E8D7C612D23AF7205F0239978CBD9C491767A0C1D237CC.zip

Import the latest UNITE data into QIIME 2:

Import the UNITE sequences for the smaller dataset selected with dynamic thresholds determined by fungal experts.

There has been discussion about whether trimming the database matters for classification. The QIIME team found that trimming the UNITE database does not result in better classification when untrimmed reads are used and recommended using the untrimmed developer database. Since we are using the trimmed ITS region, this tutorial recommends using the trimmed database but this has not yet been systematically compared.

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path sh_refs_qiime_ver9_dynamic_29.11.2022.fasta \
  --output-path ./pacbio/unite.qza

Run time 10 seconds

Import the associated UNITE taxonomy file.

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path ./pacbio/sh_taxonomy_qiime_ver9_dynamic_29.11.2022.txt \
--output-path ./pacbio/unite-taxonomy.qza

Run time 5 seconds

Train the QIIME classifier

QIIME provides its own naive Bayes classifier similar to RDP from the python package SciKit Learn. Before using it the classifier must be trained using the data you just imported.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ./pacbio/unite.qza \
  --i-reference-taxonomy ./pacbio/unite-taxonomy.qza \
  --o-classifier ./pacbio/classifier.qza

Run time: 19 minutes

Note: If you come accross an error saying "[Errno 28] No space left on device". You may need to empty your qiime2 temp directory. See discussion here: Temp directory space issue

Classify the sequence variants

Once the classifier is trained sequences can be classified.

qiime feature-classifier classify-sklearn \
  --i-classifier ./pacbio/classifier.qza \
  --i-reads ./pacbio/dada2out/representative_sequences.qza \
  --o-classification ./pacbio/taxonomy.qza

Run time: 49 seconds

Summarize the results

Summarize the results for visualization in the QIIME 2 viewer.

qiime metadata tabulate \
  --m-input-file ./pacbio/taxonomy.qza \
  --o-visualization ./pacbio/taxonomy.qzv

Run time: 4 seconds

Create an interactive bar plot figure

qiime taxa barplot \
  --i-table ./pacbio/dada2out/table.qza  \
  --i-taxonomy ./pacbio/taxonomy.qza \
  --m-metadata-file ./pacbio/mapping_pacbio.txt \
  --o-visualization ./pacbio/taxa-bar-plots.qzv

Run time: 4 seconds

The benefit of long read Pacbio analysis is the ability to see differences in taxonomy when trimming to different regions. The example above trimmed to the ITS1 region in the ITSxpress trim-single command, but you can do the same with the ITS2 and ALL regions and compare the taxa-bar-plots:

ITS2

ALL

You'll see that including more conserved regions with the ALL trim command in ITSxpress, some of the finer scale variability is lost. This tutorial provides the basic process for analyzing ITS sequences. The data is now in a form where it can be analyzed further using many of the other methods provided by QIIME 2.

Citation information for ITSxpress

  • Rivers AR, Weber KC, Gardner TG et al. ITSxpress: Software to rapidly trim internally transcribed spacer sequences with quality scores for marker gene analysis [version 1; referees: awaiting peer review]. F1000Research 2018, 7:1418
    doi: 10.12688/f1000research.15704.1

  • ITSxpress software: DOI:10.5281/zenodo.1304348

17 Likes

Thank you for the excellent tutorial! When trying to install ITxpress within my Q2 environment I get this error:

(qiime2-2018.6) mpg-bullington-mba-2:~ lbullington$ conda install q2-itsxpress

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

- q2-itsxpress

Current channels:

- bioconda/osx-64

- bioconda/noarch

- main/osx-64

- main/noarch

- Anaconda packages for MacOSX x86_64 (64-bit)

- Anaconda packages (noarch)

- r/osx-64

- r/noarch

- Anaconda extras for MacOSX x86_64 (64-bit)

- Anaconda extras (noarch)

To search for alternate channels that may provide the conda package you're

looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Any suggestions?

Looks like bioconda didn’t get added correctly. Try:

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Thanks for your time on this. :slight_smile: I have uninstalled and re-installed qiime2, and I ran the commands you suggested but still get a similar error:

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

- q2-itsxpress
Current channels:

- conda-forge/osx-64

- conda-forge/noarch

- bioconda/osx-64

- bioconda/noarch

- main/osx-64

- main/noarch

- Anaconda packages for MacOSX x86_64 (64-bit)

- Anaconda packages (noarch)

- r/osx-64

- r/noarch

- Anaconda extras for MacOSX x86_64 (64-bit)

- Anaconda extras (noarch)

Sorry, I made a mistake in the install instructions but I just updated the tutorial to correct the issue. The correct installation commands are:

conda config --add channels bioconda
conda install itsxpress
pip install q2-itsxpress
1 Like

This install works fine for me now. The plugin however, doesn’t appear to be compatible with ‘EMPPairedEndSequences’ type files as produced by the following command:

qiime tools import
–type EMPPairedEndSequences
–input-path raw-seqs
–output-path paired-end-sequences.qza

Where I get this error:

Plugin error from itsxpress:

Argument to parameter ‘per_sample_sequences’ is not a subtype of SampleData[PairedEndSequencesWithQuality].

Debug info has been saved to /var/folders/x4/jr46ngl90px4376_dbktnc6jmk70ht/T/qiime2-q2cli-err-sv581_bv.log

Is this something that this plugin will accept in the future? I am happy to help you troubleshoot things as much as I am able, as I think this is a valuable tool for QIIME (and me :)).

Cheers,

–Lorinda

@Lorinda --- I am not involved with the development of this plugin, but it looks like you are trying to use multiplexed sequences where demultiplexed sequences are expected.

Use demux emp-paired to demux, then provide that is input to itsexpress.

Again, I am not involved with the development of this plugin, but I would suggest that maybe that kind of functionality doesn't make sense in the context of this plugin. I think the general design idea here in QIIME 2 is to compose modular building blocks of functionality. Each plugin/method should do one thing, and one thing well.

Ah I see, I was just incorrectly adopting this to my fit my protocol. It seems to work using the demultiplexed file. Thanks!

1 Like

Thanks @thermokarst, I’ll add an explanatory note on the tutorial explaining that demultiplexed sequences are needed.

2 Likes

@Adam_Rivers, thanks a lot for your work on q2-itsxpress and for this tutorial! I worked through it and I have a few suggestions for improvements. There are only two changes that I think might be critical to make as soon as possible (noted as critical below) - the rest you should just think of as suggestions that you can take or leave.

  1. Would it be possible to provide wget or curl commands for tutorial file downloads? That will help us to automate testing of this tutorial in the future, which will help you to ensure that the commands in the tutorial never become out of date (e.g., because of changing interfaces or broken file download links).

  2. I recommend replacing or expanding your text “paired-end sequences with quality” or “sequences with quality” with literal blocks containing the names of these types (i.e., SampleData[PairedEndSequencesWithQuality] or SampleData[SequencesWithQuality]) as that may be more clear for users.

  3. I recommend replacing your conda config and conda install command with conda install -c bioconda itsxpress. This will then only impact the current installation, and doesn't impact what channels conda will look for at other times (e.g., when the user is using conda to install something unrelated).

  4. It looks like you don't have citations officially associated with your plugin:

    (qiime2-2018.6) 13:25:23 temp$ qiime itsxpress --citations
    No citations found.
    (qiime2-2018.6) 13:25:38 temp$ qiime itsxpress trim-pair --citations
    No citations found.
    

    See here and here for an example of how to add a citation to your plugin. Without this information, users will have a harder time knowing how to cite your work.

  5. I've attached a manifest file (manifest.txt (170 Bytes)) that you could include in your tutorial that won't require the user to make edits before importing if the manifest.txt is in the same directory as the fastq files.

  6. This one is really probably just personal preference, but I recommend making all of the commands in your tutorial single-threaded only (i.e., drop --p-threads 2 and any similar parameters from commands that use them). Since that parameter is a little bit dependent on the environment where the command is running, and since it's not required, I usually just opt to not specify options for parallel processing in tutorials.

  7. critical The merged sequences can be fed directly into Dada2 using the denoise-single command. I don't think this is correct. While this will technically execute I think you're violating assumptions that DADA2 is making about the error profiles, so DADA2 itself won't work as expected. The output semantic type for the trim-pair method should instead be SampleData[JoinedSequencesWithQuality]. These data can be passed to methods which accept this input such as qiime quality-filter q-score-joined or qiime deblur denoise-other, but my understanding is that it is technically incorrect to pass pre-joined data to DADA2. (If you could trim but not join the reads, and output a SampleData[PairedEndSequencesWithQuality], than that should be ok to pass to qiime dada2 denoise-paired.)

  8. critical SampleData[JoinedSequencesWithQuality] should not be an allowed input type to trim-single. This is a bit annoying, and something we're going to fix soon, but because you have to specify the output type as SampleData[SequencesWithQuality], the information that these reads have been joined would be lost in the process of executing trim-single. In the future, methods will support mapping of input types onto output types so this information could be preserved. In the meantime, I recommend just not allowing the user to pass pre-joined sequences to the itsxpress methods, since they can join either during or after running itsxpress.

  9. Looks like you left time in one of your commands: time qiime dada2 denoise-single.

Thanks again for this contribution @Adam_Rivers! We're very excited to have this a part of the QIIME 2 ecosystem! Please let me know if you have any questions about this.

6 Likes

Hello Adam_Rivers
when I used your data ,I got a mistake .
when I use “deblur denoise-16S” to get “table.qza ;rep-seqs.qza” but the two files are empty. but it seems not because of the data

I'm not the Deblur developer, but It looks like qiime deblur deblur-16S is designed for 16S sequences not ITS sequences. You can follow the tutorial and use Dada2 or use qiime deblur denoise-other and provide it an appropriate ITS positive filter file.

The help for this command says.

Usage: qiime deblur denoise-16S [OPTIONS]

Perform sequence quality control for Illumina data using the Deblur
workflow with a 16S reference as a positive filter. Only forward reads are
supported at this time. The specific reference used is the 88% OTUs from
Greengenes 13_8. This mode of operation should only be used when data were
generated from a 16S amplicon protocol on an Illumina platform. The
reference is only used to assess whether each sequence is likely to be 16S
by a local alignment using SortMeRNA with a permissive e-value; the
reference is not used to characterize the sequences.

1 Like

Thanks for reviewing @gregcaporaso . I'll work through the comments. I did have questions about the output data types in point 7.

I was looking at q2_dada2's input types and it looks like:

I don't see anything about the type SampleData[JoinedSequencesWithQuality] being accepted by dada2 is it a subtype or something?

If I did change types, can I just change my data output type from SampleData[SequencesWithQuality] to SampleData[JoinedSequencesWithQuality] without any additional changes?

I think that Dada2 has the ability to learn error rates from single ended data as well as paired end data now, but I don't know the method. @benjjneb can you provide guidance on this? How are error profiles calculated in denoise_single ? Will merged scores interfere with the error rate estimation procedures?

Merging by BBMerge does change the quality scores since most positions are verified by two reads. The score change is shown for one ITS1 sample here:

Changing to a paied end output requires a major rewrite of ITSxpress so I'd like to explore other options first.

I'm not quire sure I understand point 8, but I can remove that input type. Do I then need to add a third command for SampleData[JoinedSequencesWithQuality] that outputs the same type?

Adam

In short, DADA2 does not recommend processing pre-merged reads, because the different regions of pre-merged reads (Forward-only, overlapping, reverse-only) have different relationships between the assigned quality scores and the error rates, which can lead to false-positive ASV inference. You can see more discussion of this here: Fungal ITS pre-processing suggestion · Issue #327 · benjjneb/dada2 · GitHub

This is a bit annoying, and we would like to support pre-merged reads, but when we evaluated this possibility again recently we were not able to attain the same accuracy on such reads as we could in our recommended merge-later workflow.

Oh, that’s a bummer. I saw that Dada2 added single-end support so I went forward with using merging in ITSxpress based on my bad assumption that merged reads could be used equally well.

One solution to the issue could be to use unsupervised HMM training to estimate an emission and transition matrix for the merged and unmerged regions based on the pattern of quality scores. Then the Hmm could be applied to segment the reads and learn three different error rates. It’s not trivial though.

How are error rates learned for unpaired sequences since they cannot be merged? Are similar reads clustered then compared?

I wanted to follow up with a question about Deblur for ITS. @wasade and @gregcaporaso, in general, what are your thoughts on the appropriateness of using merged data in Deblur? How does Deblur handle merged data and does merging impact the performance of the Deblur denoise-other algorithm? Also what is an appropriate positive filter file for ITS regions using denoise-other?

Does ITSxpress work on single-end reads? If so, why not use it to trim ITS in forward/reverse separately, and then denoise with dada2?

deblur can handle pre-merged reads — actually, paired-end reads must be joined prior to passing to q2-deblur.

This is just to perform a rough positive filter. I've used the UNITE sequences clustered at 97% (mostly because pre-clustered seqs existed at that level) but you could probably go lower. For 16S I think the greengenes 88% OTUs are used.

1 Like