Clustering sequences into OTUs using q2-vsearch

tutorial
vsearch
otus

(Greg Caporaso) #1

This community tutorial has been migrated to our official documentation. Please refer to that tutorial instead.

Click to see original community tutorial

De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2.

Clustering of sequences or features into OTUs using vsearch is currently possible from demultiplexed, quality-controlled sequence data (i.e., a SampleData[Sequences] artifact), or from dereplicated, quality-controlled data in feature table and feature representative sequences (i.e., the FeatureTable[Frequency] and FeatureData[Sequence] artifacts, which could be generated using the qiime dada2 denoise-* or qiime deblur denoise-* commands). The first option is currently performed in two steps (but will likely be accessible through a single command in the future for convenience). The second option is performed in one step.

QIIME 1 users: demultiplexed, quality-filtered sequence data is synonymous with the seqs.fna file, generated by the QIIME 1 split_libraries*.py commands.

After working through this tutorial, you will know how to run both de novo and closed-reference clustering. This will be illustrated beginning with a QIIME 1 seqs.fna file that will be read into an SampleData[Sequences] artifact. If you already have FeatureTable[Frequency] and FeatureData[Sequence] artifacts that you’d like to cluster, you can skip ahead to the Clustering of FeatureTable[Frequency] and FeatureData[Sequence] section of this tutorial.

Downloading data used in this tutorial

seqs.fna
reference sequences for closed-reference OTU clustering

Dereplicating a SampleData[Sequences] artifact

If you are beginning your analysis with demultiplexed, quality controlled sequences, such as those in a QIIME 1 seqs.fna file, your first step is to import that data into a QIIME 1 artifact. The semantic type used here is SampleData[Sequences], indicating that the data represents collections of sequences associated with one or more samples.

qiime tools import \
  --input-path seqs.fna \
  --output-path seqs.qza \
  --type SampleData[Sequences]

After importing data, you can dereplicate it with the dereplicate-sequences command.

qiime vsearch dereplicate-sequences \
  --i-sequences seqs.qza \
  --o-dereplicated-table table.qza \
  --o-dereplicated-sequences rep-seqs.qza

The outputs from dereplicate-sequences are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureTable[Frequency] artifact is the feature table indicating the number of times each amplicon sequence variant (ASV) is observed in each of your samples. The FeatureData[Sequence] contains the mapping of each feature identifier to the sequence variant that defines that feature. These files are analogous to those generated by qiime dada2 denoise-* and qiime deblur denoise-*, except that no denoising, chimera removal, or other quality control has been applied in the dereplication process. (In this example, the only quality control of these data is what was applied outside of QIIME 2, before the import step.)

Clustering of FeatureTable[Frequency] and FeatureData[Sequence]

OTU clustering in QIIME 2 is currently applied to a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. These artifacts can come from a variety of analysis pipelines, including qiime vsearch dereplicate-sequences (illustrated above), qiime dada2 denoise-*, qiime deblur denoise-*, or one of the clustering processes illustrated below (for example, to recluster data at a lower percent identity).

The sequences in the FeatureData[Sequence] artifact are clustered against one another (in de novo clustering) or a reference database (in closed-reference clustering), and then features in the FeatureTable are collapsed, resulting in new features that are clusters of the input features.

De novo clustering

De novo clustering of a feature table can be performed as follows. In this example, clustering is performed at 99% identity to create 99% OTUs.

qiime vsearch cluster-features-de-novo \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --o-clustered-table table-dn-99.qza \
  --o-clustered-sequences rep-seqs-dn-99.qza \
  --p-perc-identity 0.99

The outputs from this process are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact will contain the centroid sequence defining each OTU cluster.

Closed-reference clustering

Closed-reference clustering of a feature table can be performed as follows. In this example, clustering is performed at 85% identity against the Greengenes 13_8 85% OTUs reference database. The reference database is provided as a FeatureData[Sequence] artifact.

Note: Closed-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform closed-reference clustering.

qiime vsearch cluster-features-closed-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences 85_otus.qza \
  --p-perc-identity 0.85 \
  --o-clustered-table table-cr-85.qza \
  --o-unmatched-sequences unmatched.qza

The outputs from cluster-features-closed-reference are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact in this case is not the sequences defining the features in the FeatureTable, but rather the collection of feature ids and their sequences that didn’t match the reference database at 85% identity. The reference sequences provided as input should be used as sequences defining the features in the FeatureTable in closed-reference OTU picking.


OTU picking method and downstream PICRUSt analysis
Script for open_reference_OTU picking in QIIME2
Doing taxonomy analysis and getting abundancies with manifests
Combine Feature tables to show OTUs and Frequency in a sample
Downloading files from QIITA to import into QIIME2 - q about generating rep-seqs.qza
Which strategy for taxonomy assign
QIIME 2 2017.9 release is now live!
Analysis of fastq files
Is there a method to skip denoising and make table.qza and rep-seqs.qza?
How to use pick_otus command line qiime2 in virtual box
Diversity plug-in error
How to create a feature table with qiime2 for PICRUST with the taxonomic assignment?
How to create a feature table with qiime2 for PICRUST with the taxonomic assignment?
Importing and Demultiplex process for 4 Fastq Files: R1, R2, Index1 and Index2
Vsearch error when using dereplicate-sequences
Problem with DAD2 (return code 1)
Importing 454 raw reads
Percentage of sequences that survive clustering
Major differences in alph diversity between q1 and q2
Preparing SILVA132 for QIIME1/2 Use
Bray-Curtis PCoA results visualisation problem
Deblur analysis merged sequence
Performance and running time of classify-consensus-blast+
Analyzing paired end reads in QIIME 2
How to do multiplex of Ion torrent sequences
Having problems with FASTA file
Subsampled open reference OTU clustering
Truncate reads before VSEARCH with pyrosequecing/ion torrent data
Qiime2 (2018.2.0) vs Qiime 1 (1.9)
Qiime demux summarize error with NextSeq data
Importing demultiplexed FASTA file & Mapping file
Deblur Qiime1.9.1 demux file
I want to change my OTU id into gg_13_8 OTU ID
Poll: Tell us about your amplicon sequencing data!
(Matthew Ryan Dillon) #5