Using q2-clawback to assemble taxonomic weights

Note: This guide assumes you have installed QIIME 2 using one of the procedures in the install documents.

This tutorial extends the q2-feature-classifier tutorial to show how you can improve taxonomic classification accuracy by assembling ecologically informed taxonomic weights for your samples using q2-clawback. This approach is described in this article:
https://doi.org/10.1038/s41467-019-12669-6

We will retrain the naive Bayes classifier from the q2-feature-classifier to classify the skin and tongue samples from the Moving Pictures, then show how you might assemble weights for classifying a specific group of samples, in this instance taken from Stilton cheeses.

We will also demonstrate how to get a rough idea of how important your bespoke weights are for classification accuracy.

We will download and create several files, so first create a working directory.

mkdir clawback-tutorial
cd clawback-tutorial

Installing q2-clawback

pip install redbiom
pip install q2-clawback

or

conda install -c kaehler -c conda-forge q2-clawback

Obtaining and importing reference data sets

We will require reference sequences and taxonomies to train the classifiers and also to create appropriate taxonomic weights.

To reduce computation time for this tutorial we will use the relatively small Greengenes 13_8 85% OTU data set. Do not use the 85% OTU data set used in this tutorial for classification of real experimental data. We recommend using more information-rich sequences, e.g., reference sequences clustered at 99% sequence similarity, for classification of real data. See the QIIME 2 data resources page for links to complete QIIME-compatible reference datasets.

We will also download the representative sequences from the Moving Pictures tutorial to test our classifier.

wget -O "85_otus.fasta" "https://data.qiime2.org/2018.6/tutorials/training-feature-classifiers/85_otus.fasta"
wget -O "85_otu_taxonomy.txt" "https://data.qiime2.org/2018.6/tutorials/training-feature-classifiers/85_otu_taxonomy.txt"
wget -O "rep-seqs.qza" "https://data.qiime2.org/2018.6/tutorials/filtering/sequences.qza"
wget -O "table.qza" "https://data.qiime2.org/2018.6/tutorials/filtering/table.qza"
wget -O "sample-metadata.tsv" "https://data.qiime2.org/2018.6/tutorials/moving-pictures/sample_metadata.tsv"

Next we import these data into QIIME 2 Artifacts.

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path 85_otus.fasta \
  --output-path 85_otus.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path 85_otu_taxonomy.txt \
  --output-path ref-taxonomy.qza

Check Data Availability using redbiom

q2-clawback provides a convenience command to check how much data is available, broken down by metadata category, and available contexts. We focus on EMPO 3 habitat types, which we showed to affective for increasing accuracy in the paper.

qiime clawback summarize-Qiita-metadata-category-and-contexts \
  --p-category empo_3 \
  --o-visualization available_empo3.qzv

We would like to classify skin sequences so we select "Animal surface". We would like sequence variants (SVs), and we would like them to be as long as possible, so we select a context that starts with "Deblur" and contains "150nt". The best current context for this combination is "Deblur-Illumina-16S-V4-150nt-780653", so we will use that in the next command.

Assembling Weights

We will need a classifier for the purpose of classifying the downloaded SVs. We can see from context identifier that the SVs come from V4, so we extract that from the reference sequences. We then use them to train a classifier.

qiime feature-classifier extract-reads \
  --i-sequences 85_otus.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer GGACTACNVGGGTWTCTAAT \
  --o-reads ref-seqs-v4.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs-v4.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier uniform-classifier.qza

We are now ready to download the skin samples from Qiita and assemble the weights. This step downloads all of the Qiita Animal surface samples with 150 nt reads, so it takes a while, around half an hour. It is the slowest step in the tutorial.

qiime clawback assemble-weights-from-Qiita \
  --i-classifier uniform-classifier.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --i-reference-sequences ref-seqs-v4.qza \
  --p-metadata-key empo_3 \
  --p-metadata-value "Animal surface" \
  --p-context Deblur-Illumina-16S-V4-150nt-780653 \
  --o-class-weight animal-surface-weights.qza

Retrain the Classifier

We can now retrain the classifier using the bespoke weights.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs-v4.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --i-class-weight animal-surface-weights.qza \
  --o-classifier animal-surface-classifier.qza

Classify as Normal

Well, not quite as normal. The Moving Pictures representative sequences contain gut, skin, and tongue samples. We only want the skin and tonge samples, so we filter out the gut samples and then use the classifier as usual.

qiime feature-table filter-samples \
  --i-table table.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-where "BodySite!='gut'" \
  --o-filtered-table no-gut-table.qza

qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --i-table no-gut-table.qza \
  --o-filtered-data no-gut-seqs.qza

qiime feature-classifier classify-sklearn \
  --i-classifier animal-surface-classifier.qza \
  --i-reads no-gut-seqs.qza \
  --o-classification taxonomy.qza

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Assembling More Exotic Weights

We can unpack some of the q2-clawback commands to assemble weights either from your own curated data or from Qiita data that is defined by a more interesting search. In this example we download all of the Stilton cheese samples from Qiita, but the data could come from anywhere.

redbiom search metadata "cheese where cheese_type=='stilton'" > sample_ids

redbiom fetch samples \
  --from sample_ids \
  --context Deblur-Illumina-16S-V4-150nt-780653\
  --output samples.biom

qiime tools import \
  --type FeatureTable[Frequency] \
  --input-path samples.biom \
  --output-path samples.qza

qiime clawback sequence-variants-from-samples \
  --i-samples samples.qza \
  --o-sequences sv.qza

qiime feature-classifier classify-sklearn \
  --i-classifier uniform-classifier.qza \
  --i-reads sv.qza \
  --p-confidence=disable \
  --o-classification classification.qza

qiime clawback generate-class-weights \
  --i-reference-taxonomy ref-taxonomy.qza \
  --i-reference-sequences ref-seqs-v4.qza \
  --i-samples samples.qza \
  --i-taxonomy-classification classification.qza \
  --o-class-weight stilton-weights.qza

The stilton-weights.qza can now be used to train a bespoke classifier for use with samples taken from Stilton cheese.

As another example, if you would like to train a classifier specifically for classifying sequences from human stool samples, you could replace the first command above with

redbiom search metadata "where host_taxid==9606 and (sample_type=='stool' or sample_type=='Stool')" > sample_ids

As there are currently 164 samples in the Stilton example and 31,955 human stool samples, the downstream commands will take much longer. It is recommended that you set --p-n-jobs for classify-sklearn to the largest number that you computer will accomodate. Note that you may be restricted by memory before you are restricted by the number of CPUs.

10 Likes

An off-topic reply has been split into a new topic: Q2-clawback questions

Please keep replies on-topic in the future.

A post was split to a new topic: understanding redbiom “contexts” for q2-clawback