Using a Classifier, Error: Classifier does not support confidence values

aalex · June 13, 2019, 3:46pm

Hello again!

I ran into some issues while trying to extract reads to train my classifier on. I decided to train it using the whole gene sequences I gathered from the BOLD database (the sequences are COI genes from Arthropoda), which was stated in another forum post as something that might not make significant differences in output for amplicon information using COI.

However, I still seem to be goofing something! I am using the rep-seq.qza from the denoising step (one of the outputs from the DADA2 step), and the classifier I trained on the full gene sequences and the taxonomy. My concern is that one of the files might not be correctly used - or that the initial input for them wasn't up to par.

I have checked that my rep-seq.qza comes from the DADA2 denoising step, and I have reconfirmed the formatting for the inputs in training the classifier with a colleague - and it seemed reasonable.

When I try to run the following command:

qiime feature-classifier classify-sklearn --i-classifier again_classifier.qza --i-reads Qiime_try/rep-seqs.qza --o-classification evaluated_seq.qza --verbose > log_file_paired-ends.txt

This is the error I receive:

File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "</home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/decorator.py:decorator-gen-338>", line 2, in classify_sklearn
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 365, in callable_executor
output_views = self._callable(**view_args)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 212, in classify_sklearn
reads, classifier, read_orientation=read_orientation)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 169, in _autodetect_orientation
result = list(zip(*predict(first_n_reads, classifier, confidence=0.)))
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 45, in predict
for chunk in _chunks(reads, chunk_size)) for m in c)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 917, in call
if self.dispatch_one_batch(iterator):
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 549, in init
self.results = batch()
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in call
for func, args, kwargs in self.items]
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in
for func, args, kwargs in self.items]
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 52, in _predict_chunk
return _predict_chunk_with_conf(pipeline, separator, confidence, chunk)
File "/home/andreaa/miniconda3/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 68, in _predict_chunk_with_conf
raise ValueError('this classifier does not support confidence values')
ValueError: this classifier does not support confidence values
Plugin error from feature-classifier: this classifier does not support confidence values
See above for debug info.

A previous forum post (Classifier does not support confidence values) rectified this by adding this parameter change --p-confidence -1, but this hasn't seemed to change my results. The issue is a repetitive one, but I think this is not sourced by the same issue as from previously asked about forum posts.

Nicholas_Bokulich · June 13, 2019, 9:29pm

Hi @aalex,

Bingo. Sounds like a formatting issue in your taxonomy file; see this topic for an explanation and solution:

Let me know if that fixes this problem!

aalex · June 14, 2019, 3:15am

Thank you so much for responding, Nicholas!

The error they ran into seemed to be coming from the fact that they were extracting reads based on the primers, after they had removed the primers from the sequences. I've double checked the taxonomy (really I used an awk/python script to try and make it, so everything should be fairly consistent).

This is it, just in case

def rewrite_taxonomy(filename):
df = pd.read_csv(filename, delimiter= '\t')
new_file = "cleaned"+ filename[:-4] + ".txt"
f = open(new_file, "w")
for row in df.itertuples():
print(row.index)
f.write
(
"{0}\tk__Animalia;p__{1};c__{2};o__{3};f__{4};g__{5};s__{6}\n".format(row.processid.strip(), row.phylum_name, row.class_name, row.order_name, row.family_name, row.genus_name, row.species_name))
f.close()

This looks a little clunky and I'm sorry about that, but it essentially read through the BOLD provided taxonomy and made a tsv with one column acting as the ID, and the second as the taxonomy associated with the sequence ID. Because I'm not extracting any reads based on the primers I used, I'm thinking it's a different issue (likely still formatting).

I wanted to check if I was using the correct rep-seqs.qza (I was worried it was just the inappropriate input), but I did some of the introductory phylogenetic operations on the artifact from the moving pictures tutorial and there weren't any problems that came up.

Thank you again, for your help!!

Nicholas_Bokulich · June 14, 2019, 4:56am

Thanks for sending along your awk script to clarify your process!

Would you mind sending along your reference sequences and taxonomy? And your query sequences or a minimum subset of these sequences that cause this error? You can send a direct message to me if you do not want to post these publicly here.

Nicholas_Bokulich · June 15, 2019, 6:33pm

Hey @aalex,
One more little step to diagnose this issue.

Could you please use this one-liner to make sure you have an even number of ranks on each line, and let us know the result?:

awk -F\; '{print NF-1}' reference_taxonomy.txt | sort -u

That will tell you the number of semicolon-delimited ranks in each line. If you receive more than one number in the output, then we can work to find the line that is shorter (or longer) than the rest.

I know, I know, your python script should be outputting an even number of ranks, but let's just make sure.

Thanks!

devonorourke · June 15, 2019, 7:45pm

@aalex @Nicholas_Bokulich,
Two quick thoughts:

Have you tried using QIIME's VSEARCH or BLAST alignment tools to classify your repseq's yet? If your formatting is correct, it should work for that system, and you can look at the taxonomies to see if most ASVs are getting classified to something reasonable. Then you'll at least know the Naive Bayes training is working with a properly formatted file.
If you already did that, I wonder about the classifier training file and the associated log. If you used all the arthropod sequences from BOLD like I have, that makes about a 500 Gb sized classifier artifact when you finish the training part. Did your log mention anything about the number of seqs used in the training. My initial attempts using primer sequences failed miserably, but I remember that log file mentioning how many files made it through trimming. Maybe it's a quick check on whether your classifier contains all the sequences you expect.

One other idea- did your Python script remove duplicate sequences? That is, have you dereplicated the reference dataset? I wonder if having multiple copies if the same sequence is an issue at all. Nick would know.

Nicholas_Bokulich · June 15, 2019, 10:09pm

good idea — this should not cause an issue unless if the seq IDs are replicates, but even then I believe we would get a different error. @aalex could you please also run qiime tools validate on the reference sequence and taxonomy QZAs? That might yield some useful information.

Very good call @devonorourke — those tools should work even if the taxonomy is in a shambles. It is usually only classify-sklearn that gets very nitpicky about having even ranks.

aalex · June 16, 2019, 3:09pm

Thank you so much, for both of your guyses help,

I ran the one-liner, and there was only one output, 6. So it seems that the ranks are even! I likewise ran qiime tools validate:

qiime tools validate ref-taxonomy.qza
Result ref-taxonomy.qza appears to be valid at level=max.

and

qiime tools validate new_fasta.qza
Result new_fasta.qza appears to be valid at level=max

I will try to use the vsearch alignment tool for this, see if it produces any error.

aalex · June 19, 2019, 3:23pm

While running

qiime feature-classifier classify-consensus-vsearch

on my sequences, the program seemed fine. It took a few hours to run on 3 cores, but I don't think that's super out of the blue. I think I found a source for the problem, and it might be my taxonomy file. I was trying to collapse the taxa to level 5, taking @devonorourke's advice to try and use the vsearch aligner, and I am getting an error if I ask to collapse on any level other than 1.

Plugin error from taxa:
Requested level of 3 is larger than the maximum level available in taxonomy data (2).
See above for debug info.

I had separated each levels of taxonomy only by semi-colon, should it have been in this format instead?

ID\tk__;\tp__ .... etc?

devonorourke · June 19, 2019, 3:41pm

My taxonomy file looks like this:
taxaID --tab-- taxonomy string
Notably, there is only a single tab in the entire line.

What you wrote above suggests that you have more than one tab. Are you putting a tab between every taxonomic rank? Like:

someID\tk__Animalia\t;phylum__Arthropoda\t;c__Insecta\t;o__Megaloptera\t;

Notice how I have \t between every group in the line above?

I think the way that the taxonomy file is parsed by QIIME is using two separate delimiters: the file is supposed to be a 2-column file (ID for col1, tax string for col2). My guess is the code uses the \t tab delimiter for parsing columns.
Then, to parse the levels of the taxonomy string it'll need to split by a different delimiter, in this case the ; semicolon.

Your error above is telling you it only sees two levels of taxonomy, which makes me think it's parsing your taxonomy strings in a way you don't want, and that's probably due to the extra tabs you don't need.

Try stripping out all the tabs from the second field of data and you should be okay.

There's got to be a sed one-liner to do this, but all I can think of for now is:

cat taxonomy.file | \
sed 's/\t/|/1' | sed 's/\t//g' | sed 's/|/\t/' > taxonomy.file2

You'll change the first instance of the tab to a pipe, then replace all the remaining tabs to nothing, then replace the pipe back to a tab.

Don't trust my BASH skills; definitely copy the taxonomy file before attempting any edits. Good luck!

aalex · June 19, 2019, 5:15pm

Hey Devon!

My taxonomy is current in the format of:
ID\tk__;p___;...

I was just wondering if that was wrong! It seems like it isn't though. Sorry if that wasn't clear. There are no spaces between the semi-colons, but I refer to phylum as "p", rather than how you have it. Could that be the source of error?

devonorourke · June 19, 2019, 5:24pm

Can you paste the first 3 lines of text of the taxonomy file please?

aalex · June 19, 2019, 5:59pm

ARONW839-15 k__Animalia;p__Arthropoda;c__Arachnida;o__Araneae;f__Salticidae;g__Eris;s__Eris militaris
BBUSU277-15 k__Animalia;p__Arthropoda;c__Arachnida;o__Araneae;f__Tetragnathidae;g__Tetragnatha;s__Tetragnatha elongata
CNFNR1205-14 k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Cecidomyiidae;g__nan;s__nan

(Thank you again for all your help!)

How I formatted the taxonomy file entirely from the python script I posted before (I referred to it as my awk script, that was a slip!) So there is only a tab between the ID and the taxonomy.

devonorourke · June 19, 2019, 6:19pm

I think you might need to take this up with the higher-ups in the QIIME dev team.

Two totally random ideas:

Maybe there's an encoding issue? Are you using a UNIX type environment or Windows? Are you using the same kind of OS to create the taxonomy file as you are for applying it in QIIME?

could try:

file {path/to/taxonomy.file}

Just double check that the encoding is what you think it should be (and what QIIME wants). Sometimes programs have failed on me because some thing I coped from another place had some Windows <--> Unix issues.

Assuming the encoding is correct and your taxonomy file is formatted properly (it appears to be), can you sift through the VSEARCH taxonomy output and tell me whether or not you have more than 2 levels of taxonomy applied to any ASV that was classified? In other words, do you see anything that is assigned with a Species name, or are all your VSEARCH-classified taxa stuck at the Phylum rank?

You can probably export the .qza and figure that out a number of ways; here's one:

qiime tools export --input-path {VSEARCH.output.qza} --output-path tmp 
cat ./tmp/taxonomy.tsv | grep 's__.' | head

aalex · June 19, 2019, 10:49pm

All my vsearch-classified taxa are stuck at the kingdom level! If I try to get to add --p-level 2 as a parameter, I receive an error.

I'm in a Unix environment! Everything has been run in the Unix environment, unless there's something wrong with the sequences I'm trying to use (but this issue with the taxonomy makes me think otherwise).

(edit) I've run the last two lines of code you gave me, and it produced no output! I changed

grep 's__.' | head

to

grep 'k__.' | head

and that worked.

devonorourke · June 20, 2019, 1:28pm

Do you think you could put the entire .qza taxonomy file and sequence file used in the VSEARCH classifier in a Dropbox (or equivalent) account that I could check out?

I've been putting my COI databases in a few Open Science Framework repositories, as they allow individual files under 1Gb for free.

aalex · June 20, 2019, 2:23pm

I think I actually found the source of the issue! I compared my file once again to what the tutorial for QIIME2 on training the classifier provides, and there are spaces between the semi-colons!

I'm training the classifier again, and running what I had for vsearch to see if that will work better!

Nicholas_Bokulich · June 20, 2019, 2:37pm

spaces between semicolons should not matter

the issue is clearly with the taxonomy, but I am not certain what the issue is. Devon or I will need to see the full taxonomy file.

The fact that vsearch is not working either is really suspicious. It tells me that either:

there are serious issues with the taxonomy file
your query sequences do not match the reference sequence site (I know, silly suggestion but just enumerating all possible scenarios)

aalex · June 20, 2019, 3:09pm

VSEARCH did work, but the taxonomy only goes to Kingdom - so there is an output, but when I try to collapse the taxa, no other level is present.

qiime tools export --input-path {VSEARCH.output.qza} --output-path tmp
cat ./tmp/taxonomy.tsv | grep 's__.' | head

produced nothing, but

qiime tools export --input-path {VSEARCH.output.qza} --output-path tmp
cat ./tmp/taxonomy.tsv | grep 'k__.' | head

did result in reads, the kingdom, and what I took as a confidence score to be output. I'll upload the taxonomy.qza file as well to the link to google drive I previously sent, @Nicholas_Bokulich

devonorourke · June 20, 2019, 4:30pm

Can you please upload the VSEARCH output too ?