Using q2-clawback to assemble taxonomic weights

clawback
feature-classifier
redbiom
taxonomy

(Ben Kaehler) #1

Note: This topic is a work in progress. Expect updates in the coming days.

Note: This guide assumes you have installed QIIME 2 using one of the procedures in the install documents.

This tutorial extends the q2-feature-classifier tutorial to show how you can improve your classification accuracy by assembling appropriate taxonomic weights for your samples using q2-clawback. We will retrain the naive Bayes classifier from the q2-feature-classifier to classify the skin and tongue samples from the Moving Pictures, then show how you might assemble weights for classifying a specific group of samples, in this instance taken from Stilton cheeses.

We will also demonstrate how to get a rough idea of how important your bespoke weights are for classification accuracy.

We will download and create several files, so first create a working directory.

mkdir clawback-tutorial
cd clawback-tutorial

Installing q2-clawback

pip install redbiom
pip install imbalanced-learn
pip install q2-clawback

or

conda install -c kaehler -c conda-forge q2-clawback

Obtaining and importing reference data sets

We will require reference sequences and taxonomies to train the classifiers and also to create appropriate taxonomic weights.

To reduce computation time for this tutorial we will use the relatively small Greengenes 13_8 85% OTU data set. Do not use the 85% OTU data set used in this tutorial for classification of real experimental data. We recommend using more information-rich sequences, e.g., reference sequences clustered at 99% sequence similarity, for classification of real data. See the QIIME 2 data resources page for links to complete QIIME-compatible reference datasets.

We will also download the representative sequences from the Moving Pictures tutorial to test our classifier.

wget -O "85_otus.fasta" "https://data.qiime2.org/2018.6/tutorials/training-feature-classifiers/85_otus.fasta"
wget -O "85_otu_taxonomy.txt" "https://data.qiime2.org/2018.6/tutorials/training-feature-classifiers/85_otu_taxonomy.txt"
wget -O "rep-seqs.qza" "https://data.qiime2.org/2018.6/tutorials/filtering/sequences.qza"
wget -O "table.qza" "https://data.qiime2.org/2018.6/tutorials/filtering/table.qza"
wget -O "sample-metadata.tsv" "https://data.qiime2.org/2018.6/tutorials/moving-pictures/sample_metadata.tsv"

Next we import these data into QIIME 2 Artifacts.

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path 85_otus.fasta \
  --output-path 85_otus.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --source-format HeaderlessTSVTaxonomyFormat \
  --input-path 85_otu_taxonomy.txt \
  --output-path ref-taxonomy.qza

Check Data Availability using redbiom

q2-clawback provides a convenience command to check how much data is available, broken down by metadata category, and available contexts. We focus on EMPO 3 habitat types, which we showed to affective for increasing accuracy in the paper.

qiime clawback summarize-Qiita-metadata-category-and-contexts \
  --p-category empo_3 \
  --o-visualization available_empo3.qzv

We would like to classify skin sequences so we select “Animal surface”. We would like sequence variants (SVs), and we would like them to be as long as possible, so we select a context that starts with “Deblur” and contains “150nt”. The best current context for this combination is “Deblur-NA-illumina-16S-v4-150nt-780653”, so we will use that in the next command.

Assembling Weights

We will need a classifier for the purpose of classifying the downloaded SVs. We can see from context identifier that the SVs come from V4, so we extract that from the reference sequences. We then use them to train a classifier.

qiime feature-classifier extract-reads \
  --i-sequences 85_otus.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer GGACTACNVGGGTWTCTAAT \
  --o-reads ref-seqs-v4.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs-v4.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier uniform-classifier.qza

We are now ready to download the skin samples from Qiita and assemble the weights. This step downloads all of the Qiita Animal surface samples with 150 nt reads, so it takes a while, around half an hour. It is the slowest step in the tutorial.

qiime clawback assemble-weights-from-Qiita \
  --i-classifier uniform-classifier.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --i-reference-sequences ref-seqs-v4.qza \
  --p-metadata-key empo_3 \
  --p-metadata-value "Animal surface" \
  --p-context Deblur-NA-illumina-16S-v4-150nt-780653 \
  --o-class-weight animal-surface-weights.qza

Retrain the Classifier

We can now retrain the classifier using the bespoke weights.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs-v4.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --i-class-weight animal-surface-weights.qza \
  --o-classifier animal-surface-classifier.qza

Classify as Normal

Well, not quite as normal. The Moving Pictures representative sequences contain gut, skin, and tongue samples. We only want the skin and tonge samples, so we filter out the gut samples and then use the classifier as usual.

qiime feature-table filter-samples \
  --i-table table.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-where "BodySite!='gut'" \
  --o-filtered-table no-gut-table.qza

qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --i-table no-gut-table.qza \
  --o-filtered-data no-gut-seqs.qza

qiime feature-classifier classify-sklearn \
  --i-classifier animal-surface-classifier.qza \
  --i-reads no-gut-seqs.qza \
  --o-classification taxonomy.qza

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

Assembling More Exotic Weights

We can unpack some of the q2-clawback commands to assemble weights either from your own curated data or from Qiita data that is defined by a more interesting search. In this example we download all of the Stilton cheese samples from Qiita, but the data could come from anywhere.

redbiom search metadata "cheese where cheese_type=='stilton'" > sample_ids

redbiom fetch samples \
  --from sample_ids \
  --context Deblur-NA-illumina-16S-v4-150nt-780653\
  --output samples.biom

qiime tools import \
  --type FeatureTable[Frequency] \
  --input-path samples.biom \
  --output-path samples.qza

qiime clawback sequence-variants-from-samples \
  --i-samples samples.qza \
  --o-sequences sv.qza

qiime feature-classifier classify-sklearn \
  --i-classifier uniform-classifier.qza \
  --i-reads sv.qza \
  --p-confidence=-1 \
  --o-classification classification.qza

qiime clawback generate-class-weights \
  --i-reference-taxonomy ref-taxonomy.qza \
  --i-reference-sequences ref-seqs-v4.qza \
  --i-samples samples.qza \
  --i-taxonomy-classification classification.qza \
  --o-class-weight stilton-weights.qza

The stilton-weights.qza can now be used to train a bespoke classifier for use with samples taken from Stilton cheese.

Estimating Performance

q2-clawback also contains a utility for estimating how much of an impact the use of bespoke weights will have on classificaton accuracy. We will compare the two sets of weights generated above.

qiime clawback precalculate-nearest-neighbors \
  --i-reference-taxonomy ref-taxonomy.qza \
  --i-reference-sequences ref-seqs-v4.qza \
  --o-nearest-neighbors knn.qza

qiime clawback kNN-LOOCV-F-measures \
  --i-nearest-neighbors knn.qza \
  --i-class-weight animal-surface-weights.qza \
  --o-visualization animal-surface-f-measure.qzv

qiime clawback kNN-LOOCV-F-measures \
  --i-nearest-neighbors knn.qza \
  --i-class-weight ../redbiom/stilton-weights.qza \
  --o-visualization stilton-f-measure.qzv

Note that these tests only provide part of the story and factors such as variation amongst samples may affect the realised improvement by using bespoke taxonomic weights. Also, as we do not have systematic testing for 85% Greengenes OTU reference sequences, there is no way to map these results back to an expected performance improvement for real data. However, we would speculate from the above results that Animal surface weights are more important than Stilton weights for improving classifier accuracy.


Fit-classifier-sklearn parameter
(Matthew Ryan Dillon) #2

An off-topic reply has been split into a new topic: Q2-clawback questions

Please keep replies on-topic in the future.