silva-132-99-515-806-nb-classifier white spaces

nick-youngblut · December 5, 2019, 2:16pm

I appears the silva-132-99-515-806-nb-classifier.qza artifact (https://data.qiime2.org/2019.10/common/silva-132-99-515-806-nb-classifier.qza) contains white spaces for some of the taxonomies. This generates downstream errors such as CategoricalMetadataColumn does not support values with leading or trailing whitespace characters. Column 'Taxon' has the following value: 'D_0__Bacteria;D_1__Gemmatimonadetes;D_2__Gemmatimonadetes;D_3__Gemmatimonadales;D_4__Gemmatimonadaceae;D_5__uncultured;D_6__uncultured bacterium ' when running qiime metadata tabulate or qiime taxa filter-table.

I'd like to just filter out the whitespaces from silva-132-99-515-806-nb-classifier.qza, but exporting it via qiime tools export just creates a tar file, which then un-tars to a pkl file, which then creates the following error when trying to load it via pickle.load(): _pickle.UnpicklingError: invalid load key, 'D'.

It would be great if i) whitespaces were removed from https://data.qiime2.org/2019.10/common/silva-132-99-515-806-nb-classifier.qza and ii) it was easier to access the data within that artifact

thermokarst · December 5, 2019, 5:32pm

Hey @nick-youngblut, this was addressed in the 2019.10 release of QIIME 2, all you need to do is upgrade.

From the 2019.10 Changelog:

For some more details, I have outlined two scenarios below.

Scenario A

Taxonomy with whitespace imported prior to QIIME 2 2019.10 (example uses the same FeatureData[Sequence] for training and classification).

ref-taxonomy.qza (5.1 KB)
rep-seqs.qza (5.2 KB)

# first, export ref-taxonomy.qza to confirm there is whitespace present
qiime tools export \
  --input-path ref-taxonomy.qza \
  --output-path whitespace-check
cat whitespace-check/taxonomy.tsv

The cat will show something like this:

Feature ID      Taxon
f1           t1
f2         t2
f3       t3
f4       t4

Next, train a classifier, and classify FeatureData[Sequence].

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads rep-seqs.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier classifier.qza

qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads rep-seqs.qza \
  --o-classification taxonomy.qza

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

taxonomy.qzv (1.2 MB)

Please note the whitespace has been stripped.

qiime tools extract \
  --input-path taxonomy.qzv \
  --output-path .
# note the path will be different, UUIDs are unique
cat 45f0c56c-9ff7-48be-9e49-7d5118ece5f9/data/metadata.tsv

The results:

Feature ID      Taxon   Confidence
#q2:types       categorical     categorical
f1      t4      0.9970119521912352
f2      t3      0.9970119521912352
f3      t2      0.9970119521912347
f4      t1      0.9970119521912347

Scenario B

Importing taxonomy with whitespace in QIIME 2 2019.10 and newer.

taxonomy.tsv (40 Bytes)

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path taxonomy.tsv \
  --output-path ref-taxonomy-stripped.qza

qiime tools export \
  --input-path ref-taxonomy-stripped.qza \
  --output-path stripped
 cat stripped/taxonomy.tsv

The results:

Feature ID      Taxon
f1      t1
f2      t2
f3      t3
f4      t4

Note that the taxon strings whitespace has been stripped.

The machine classifier doesn't work like that, this is a binary file, editing it isn't recommended.

i) No need to remove the whitespace, simply upgrade. Please note, 2019.10 is the only version of QIIME 2 currently supported.
ii) You are generalizing your experience with trying to edit a binary pickle --- exporting and extracting data are first class citizens in QIIME 2, and the resulting data is in whatever format the Semantic Type represented the data as (TSV, JSON, fastq, pkl, etc). You are simply trying to do something that doesn't really make sense for this kind of data.

Hope that helps!

nick-youngblut · December 6, 2019, 7:57am

Thanks @thermokarst for the very comprehensive response! Updating to 2019.10 fixed the issue for me.