CategoricalMetadataColumn does not support values with leading or trailing whitespace characters

Dear developer,
I use Qiime2(version 2019.7.0) to classify rep-seqs.qza.

qiime feature-classifier classify-hybrid-vsearch-sklearn \
  --i-query rep-seqs.qza \
  --i-reference-reads /data/zhouyingli/qiime/Silva132_99.qza \
  --i-reference-taxonomy /data/zhouyingli/qiime/Silva132-taxonomy.qza \
  --i-classifier /data/zhouyingli/qiime/S132_classifier.qza \
  --p-threads 28 \
  --o-classification taxonomy_hybrid.qza

qiime metadata tabulate \
  --m-input-file taxonomy_hybrid.qza \
  --o-visualization taxonomy_hybrid.qzv

qiime taxa barplot \
  --i-table table.qza \
  --i-taxonomy taxonomy_hybrid.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization taxa-bar-plots_hybrid.qzv

the following is the error:
There was an issue with viewing the artifact taxonomy_hybrid.qza as QIIME 2 Metadata:

  CategoricalMetadataColumn does not support values with leading or trailing whitespace characters. Column 'Taxon' has the following value: "D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Chromatiales;D_4__Sedimenticolaceae;D_5__Sedimenticola;D_6__Escarpia spicata endosymbiont 'Alvin "

Plugin error from taxa:

  Feature IDs found in the table are missing from the taxonomy: {'2c046ebcbe3c77d290b439269f2981acfbcdae2f', '79d318932fb2dadc0ec4676d3b316cecf2e26f3b', 'a903a7338a38a6374a6253c2d065e937f6c55ce9', 'aa0aafff20819bf70a6826b74db4beec8f0399fd', '17be879cdaa0d023c80fc7dd6c478e65b0e07a4e', '4e0dc35655ba79ac928f8dedbc45427189e0595a', '12976d6d53b1557c836d8cade02bf05d5732ebe1', '4c66c461ce900875f50ba137c391ee0c8f392fce', 'dc9848cd92de1f73865519f703a7d0489c6eba3e', '5cd1a928e44d917f463e19a9fa86f51399ef2d2b', '7fd4bbac9c8d0171cace590c05df0ed21b1064a8', 'd2603311c474aa61d72c3df73b96007bed997573', '1ec69f423b70c69555265e56c90f441013e2069c'}

Debug info has been saved to /tmp/qiime2-q2cli-err-2tpqfyaw.log

Thanks in advance~

Hello @ucassee

Welcome back to the forums!

Thanks for posting your two main errors.

CategoricalMetadataColumn does not support values with leading or trailing whitespace characters.

Take a look at the very end of that taxonomy string:

D_6__Escarpia spicata endosymbiont 'Alvin "
           Here is that extra whitespace ^

EDIT: If you update to the newest version of Qiime, these white spaces will be taken care of automatically :+1:


Feature IDs found in the table are missing from the taxonomy:

Looks like the taxonomy table you passed did not have taxonomy for all the ASVs in your other table. Did your feature-classifier finish successfully?

Colin

Hi @colinbrislawn, I rerun the command, it occur the same error.
do you mean the training feature classifier process don't do successfully?

I think the error mainly because of qiime feature-classifier classify-hybrid-vsearch-sklearn step.
If I use :

qiime feature-classifier classify-consensus-vsearch \
      --i-query rep-seqs.qza \
      --i-reference-reads /data/zhouyingli/qiime/Silva132_99.qza \
      --i-reference-taxonomy /data/zhouyingli/qiime/Silva132-taxonomy.qza \
      --p-threads 28 \
      --o-classification taxonomy_vsearch.qza

it didn't occur error in the remaining steps.

1 Like

Ah OK, so classify-consensus-vsearch works, but the hybrid classifier is not finishing…

I think your intuition is correct; something is broke at the sklearn step. Any clues? :mag: :female_detective:

Colin

The hybrid classifier includes an optional filtering step to remove sequences that poorly align to the reference, so it looks like you have a handful of sequences that are being filtered by that step. You have two options to resolve:

  1. use the --p-no-prefilter option with the hybrid classifier to disable this filtering step
  2. filter your table with qiime feature-table filter-features to only keep features found in the metadata file (i.e., those that pass this filter).

You should probably do the latter; it is a very rough prefilter (50% similarity to a random subsample of the reference sequences) so anything failing to match could be junk — but it is worth manually checking those features to see (if you are filtering real sequences you may need to increase the subsample or just disable the prefilter step).

Note that the hybrid classifier should really only be used if your reference database and sequences are trimmed to exactly the same sites, e.g., with extract-reads. The first step of this pipeline performs exact matching between query and reference, so is not the same as the default classify-consensus-vsearch method... reads will be unclassified if they do not match 100% with at least one reference sequence.

1 Like

@colinbrislawn @Nicholas_Bokulich Thanks for your reply
I use qiime tools import --type 'SampleData[Sequences]' to import my 16S miTags to qiime2. So I can’t extract-reads. I will try to use --p-no-prefilter option.

2 Likes

When I use --p-no-prefilter option, the bug was fixed.
But in the classification reuslt, there are 10% of reads can only classify to Becteria this Kingdom. Could you give me some advice to improve it? The 16s mitags are in different variable regions,which are difficult to classify.

10% of reads only classifying to kingdom level? Such a low level is not too concerning — often some non-target DNA can be amplified or cross-contaminated, and should just be removed (you can spot-check a few of these unclassified ASVs with NCBI BLAST to see what they are first)

So then it is also possible that some of these reads are not classifying due to the mixed amplicons? You can use a full-length 16S classifier to classify these (with classify-sklearn).

Do not bother using the hybrid classifier — this will not be useful for your data unless if you use extract-reads to extract all possible primer pairs and merge those data together, since the first step of the classifier uses vsearch with exact match.

@Nicholas_Bokulich Thanks for your patience.

But I use some of these reads to blast in NCBI I find most of them are 16s sequences. So I feel confusing.
The resulttaxa-bar-plots_vsearch (2).qzv (2.3 MB) is classified by classify-consensus-vsearch using full-length 16S classifier.
This is classified by sklearntaxa-bar-plots_sklearn.qzv (2.5 MB)
Sklearn result seems have more reads only classifying to kingdom level.
Do you still suggest I remove all of these reads and recalculate the abundance?
PS: The unsigned and only classifying to kingdom level reads account for about 10%~20%

Thanks for sharing your results! Based on these results it sounds like this is probably related to the multi-amplicon protocol that you are using.

Sounds like those are definitely 16S reads (usually this issue indicates non-target DNA but there are exceptions which is why I always recommend checking).

As noted on the training a classifier tutorial, accuracy increases slightly when training on the primer region being targeted. Usually using the full-length classifier does not impact accuracy too much, but I have not tested all 16S domains... it is possible that some domains are impacted by this more than others. It would be very interesting to see if these unclassified/underclassified ASVs all belong to a specific 16S region, or to a specific clade.

Based on the profiles, it looks like removing these probably wouldn't impact the resulting proportions too much, since the unclassified/underclassified ASVs represent such a small fraction.

However, I hate to throw away "good" data if we are able to use it with another method. You could use classify-sklearn and try splitting out the different amplicon regions to train region-specific classifiers, then recombine after classification. But if I were in your shoes, I would use the classify-consensus-vsearch classifier, since it seems to perform better "out of the box" with your protocol. It looks like you could probably improve the results with that classifier a little more, too — maybe try using the --p-top-hits-only option and increase --p-perc-identity a bit.

Good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.