Clustering sequences into OTUs using q2-vsearch

gregcaporaso · September 29, 2017, 2:01pm

This community tutorial has been migrated to our official documentation. Please refer to that tutorial instead.

Click to see original community tutorial

De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2.

Clustering of sequences or features into OTUs using vsearch is currently possible from demultiplexed, quality-controlled sequence data (i.e., a SampleData[Sequences] artifact), or from dereplicated, quality-controlled data in feature table and feature representative sequences (i.e., the FeatureTable[Frequency] and FeatureData[Sequence] artifacts, which could be generated using the qiime dada2 denoise-* or qiime deblur denoise-* commands). The first option is currently performed in two steps (but will likely be accessible through a single command in the future for convenience). The second option is performed in one step.

QIIME 1 users: demultiplexed, quality-filtered sequence data is synonymous with the seqs.fna file, generated by the QIIME 1 split_libraries*.py commands.

After working through this tutorial, you will know how to run both de novo and closed-reference clustering. This will be illustrated beginning with a QIIME 1 seqs.fna file that will be read into an SampleData[Sequences] artifact. If you already have FeatureTable[Frequency] and FeatureData[Sequence] artifacts that you'd like to cluster, you can skip ahead to the Clustering of FeatureTable[Frequency] and FeatureData[Sequence] section of this tutorial.

Downloading data used in this tutorial

seqs.fna
reference sequences for closed-reference OTU clustering

Dereplicating a `SampleData[Sequences]` artifact

If you are beginning your analysis with demultiplexed, quality controlled sequences, such as those in a QIIME 1 seqs.fna file, your first step is to import that data into a QIIME 1 artifact. The semantic type used here is SampleData[Sequences], indicating that the data represents collections of sequences associated with one or more samples.

qiime tools import \
  --input-path seqs.fna \
  --output-path seqs.qza \
  --type SampleData[Sequences]

After importing data, you can dereplicate it with the dereplicate-sequences command.

qiime vsearch dereplicate-sequences \
  --i-sequences seqs.qza \
  --o-dereplicated-table table.qza \
  --o-dereplicated-sequences rep-seqs.qza

The outputs from dereplicate-sequences are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureTable[Frequency] artifact is the feature table indicating the number of times each amplicon sequence variant (ASV) is observed in each of your samples. The FeatureData[Sequence] contains the mapping of each feature identifier to the sequence variant that defines that feature. These files are analogous to those generated by qiime dada2 denoise-* and qiime deblur denoise-*, except that no denoising, chimera removal, or other quality control has been applied in the dereplication process. (In this example, the only quality control of these data is what was applied outside of QIIME 2, before the import step.)

Clustering of `FeatureTable[Frequency]` and `FeatureData[Sequence]`

OTU clustering in QIIME 2 is currently applied to a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. These artifacts can come from a variety of analysis pipelines, including qiime vsearch dereplicate-sequences (illustrated above), qiime dada2 denoise-*, qiime deblur denoise-*, or one of the clustering processes illustrated below (for example, to recluster data at a lower percent identity).

The sequences in the FeatureData[Sequence] artifact are clustered against one another (in de novo clustering) or a reference database (in closed-reference clustering), and then features in the FeatureTable are collapsed, resulting in new features that are clusters of the input features.

De novo clustering

De novo clustering of a feature table can be performed as follows. In this example, clustering is performed at 99% identity to create 99% OTUs.

qiime vsearch cluster-features-de-novo \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --o-clustered-table table-dn-99.qza \
  --o-clustered-sequences rep-seqs-dn-99.qza \
  --p-perc-identity 0.99

The outputs from this process are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact will contain the centroid sequence defining each OTU cluster.

Closed-reference clustering

Closed-reference clustering of a feature table can be performed as follows. In this example, clustering is performed at 85% identity against the Greengenes 13_8 85% OTUs reference database. The reference database is provided as a FeatureData[Sequence] artifact.

Note: Closed-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don't have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform closed-reference clustering.

qiime vsearch cluster-features-closed-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences 85_otus.qza \
  --p-perc-identity 0.85 \
  --o-clustered-table table-cr-85.qza \
  --o-unmatched-sequences unmatched.qza

The outputs from cluster-features-closed-reference are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact in this case is not the sequences defining the features in the FeatureTable, but rather the collection of feature ids and their sequences that didn't match the reference database at 85% identity. The reference sequences provided as input should be used as sequences defining the features in the FeatureTable in closed-reference OTU picking.