Clustering sequences into OTUs using q2-vsearch

This community tutorial has been migrated to our official documentation. Please refer to that tutorial instead.

Click to see original community tutorial

De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2.

Clustering of sequences or features into OTUs using vsearch is currently possible from demultiplexed, quality-controlled sequence data (i.e., a SampleData[Sequences] artifact), or from dereplicated, quality-controlled data in feature table and feature representative sequences (i.e., the FeatureTable[Frequency] and FeatureData[Sequence] artifacts, which could be generated using the qiime dada2 denoise-* or qiime deblur denoise-* commands). The first option is currently performed in two steps (but will likely be accessible through a single command in the future for convenience). The second option is performed in one step.

QIIME 1 users: demultiplexed, quality-filtered sequence data is synonymous with the seqs.fna file, generated by the QIIME 1 split_libraries*.py commands.

After working through this tutorial, you will know how to run both de novo and closed-reference clustering. This will be illustrated beginning with a QIIME 1 seqs.fna file that will be read into an SampleData[Sequences] artifact. If you already have FeatureTable[Frequency] and FeatureData[Sequence] artifacts that you’d like to cluster, you can skip ahead to the Clustering of FeatureTable[Frequency] and FeatureData[Sequence] section of this tutorial.

Downloading data used in this tutorial

seqs.fna
reference sequences for closed-reference OTU clustering

Dereplicating a SampleData[Sequences] artifact

If you are beginning your analysis with demultiplexed, quality controlled sequences, such as those in a QIIME 1 seqs.fna file, your first step is to import that data into a QIIME 1 artifact. The semantic type used here is SampleData[Sequences], indicating that the data represents collections of sequences associated with one or more samples.

qiime tools import \
  --input-path seqs.fna \
  --output-path seqs.qza \
  --type SampleData[Sequences]

After importing data, you can dereplicate it with the dereplicate-sequences command.

qiime vsearch dereplicate-sequences \
  --i-sequences seqs.qza \
  --o-dereplicated-table table.qza \
  --o-dereplicated-sequences rep-seqs.qza

The outputs from dereplicate-sequences are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureTable[Frequency] artifact is the feature table indicating the number of times each amplicon sequence variant (ASV) is observed in each of your samples. The FeatureData[Sequence] contains the mapping of each feature identifier to the sequence variant that defines that feature. These files are analogous to those generated by qiime dada2 denoise-* and qiime deblur denoise-*, except that no denoising, chimera removal, or other quality control has been applied in the dereplication process. (In this example, the only quality control of these data is what was applied outside of QIIME 2, before the import step.)

Clustering of FeatureTable[Frequency] and FeatureData[Sequence]

OTU clustering in QIIME 2 is currently applied to a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. These artifacts can come from a variety of analysis pipelines, including qiime vsearch dereplicate-sequences (illustrated above), qiime dada2 denoise-*, qiime deblur denoise-*, or one of the clustering processes illustrated below (for example, to recluster data at a lower percent identity).

The sequences in the FeatureData[Sequence] artifact are clustered against one another (in de novo clustering) or a reference database (in closed-reference clustering), and then features in the FeatureTable are collapsed, resulting in new features that are clusters of the input features.

De novo clustering

De novo clustering of a feature table can be performed as follows. In this example, clustering is performed at 99% identity to create 99% OTUs.

qiime vsearch cluster-features-de-novo \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --o-clustered-table table-dn-99.qza \
  --o-clustered-sequences rep-seqs-dn-99.qza \
  --p-perc-identity 0.99

The outputs from this process are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact will contain the centroid sequence defining each OTU cluster.

Closed-reference clustering

Closed-reference clustering of a feature table can be performed as follows. In this example, clustering is performed at 85% identity against the Greengenes 13_8 85% OTUs reference database. The reference database is provided as a FeatureData[Sequence] artifact.

Note: Closed-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform closed-reference clustering.

qiime vsearch cluster-features-closed-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences 85_otus.qza \
  --p-perc-identity 0.85 \
  --o-clustered-table table-cr-85.qza \
  --o-unmatched-sequences unmatched.qza

The outputs from cluster-features-closed-reference are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact in this case is not the sequences defining the features in the FeatureTable, but rather the collection of feature ids and their sequences that didn’t match the reference database at 85% identity. The reference sequences provided as input should be used as sequences defining the features in the FeatureTable in closed-reference OTU picking.

7 Likes
OTU picking method and downstream PICRUSt analysis
Script for open_reference_OTU picking in QIIME2
Combine Feature tables to show OTUs and Frequency in a sample
Doing taxonomy analysis and getting abundancies with manifests
Downloading files from QIITA to import into QIIME2 - q about generating rep-seqs.qza
Importing 454 raw reads
Truncate reads before VSEARCH with pyrosequecing/ion torrent data
Importing demultiplexed FASTA file & Mapping file
Vsearch error when using dereplicate-sequences
Analysis of fastq files
QIIME 2 2017.9 release is now live!
I want to change my OTU id into gg_13_8 OTU ID
Major differences in alph diversity between q1 and q2
Preparing SILVA132 for QIIME1/2 Use
Which strategy for taxonomy assign
Deblur Qiime1.9.1 demux file
Qiime demux summarize error with NextSeq data
Qiime2 (2018.2.0) vs Qiime 1 (1.9)
Problem with DAD2 (return code 1)
How to do multiplex of Ion torrent sequences
Having problems with FASTA file
Percentage of sequences that survive clustering
Importing .fna and workflow
Bray-Curtis PCoA results visualisation problem
Import problem: Not a(n) QIIME1DemuxFormat file
QIIME 2用户文档. 18序列双端合并read-joining(2019.7)
Deblur analysis merged sequence
Performance and running time of classify-consensus-blast+
Analyzing paired end reads in QIIME 2
Subsampled open reference OTU clustering
Importing and Demultiplex process for 4 Fastq Files: R1, R2, Index1 and Index2
How to create a feature table with qiime2 for PICRUST with the taxonomic assignment?
How to create a feature table with qiime2 for PICRUST with the taxonomic assignment?
Diversity plug-in error
How to use pick_otus command line qiime2 in virtual box
Is there a method to skip denoising and make table.qza and rep-seqs.qza?
Poll: Tell us about your amplicon sequencing data!