How do i create a taxonomic table from fasta file?

Hello everybody,

I am a new QIIME2 user. I want to work with functional genes and I am starting with the nifH gene (dinitrogenase reductase) Recently, I downloaded the database (fasta file) but the database didn’t come with the taxonomic table and the database comprise only a code name and sequence:

Name
sequence

Therefore i can’t use QIIME without a taxonomic table and I want to create a new one. I tried to find commands how to create a table on QIIME2 or search on this forum but I didn’t find (and maybe there is and i didn’t understand). If you can refer me to the specific massage that talked about it or explain what I can do, I will appreciate that.

Thanks and sorry for my English.

Hi @EGvibrio,
Welcome to the forum!
Sounds like a neat use of Qiime2. You should be fine without annotations, Qiime2 doesn’t actually require you to have “taxonomies” for most of its analyses so you shouldn’t run into any issues. Your features would simply be called whatever that code name is in the table. Have you tried any specific commands that have failed so far? If so, could you please provide us with some more details, for exa, a few lines of your fasta file, what version of Qiime2 you are using, what exact commands you typed and the full error message you receive. Thanks!

2 Likes

No, I didn’t try but if I am not mistaken, creating the classifier will require a reference taxonomy. Will it work without a reference taxonomy?

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classifier classifier.qza

Thank you very much for your answer (and sorry for a late response, I was away from my computer)

Hi @EGvibrio and welcome to the forum!

This is a great question and you are correct, you need to have a taxonomy file to classify sequences, since that file contains the taxonomic annotation information. The bad news is that if one does not already exist you will need to make it and there is not necessarily an easy way to do this.

FORMAT:
a tab-delimited (TSV) file in the format:
feature ID<TAB>semicolon;delimited;taxonomy;label

The taxonomy label does not need to be hierarchical or semicolon-delimited but it should be to work seamlessly with QIIME 2. This is because various methods, including taxonomy classifiers, use the hierarchical information to do things like consensus taxonomy classification. But if you don’t want to do the hard work of figuring out the full taxonomic lineage for all of your reference sequences you could do something like the following:

feature ID<TAB>description

or even

feature ID<TAB>feature ID

if the feature ID is the taxonomic label.

The ideal situation though is to figure out that lineage information; and ideally that information already exists in the fasta header of your file. E.g., many sequences from NCBI have descriptions as part of the header line, which may include taxonomic info like this:

A82902A  Description of sequence | now;maybe;some;taxonomy;information?
AGCTTGATCGTAGCTAGCTAGCTAGCTAGCTGATCGATCAGTCGATCGTAGC

In which case you could transform that file to create a separate taxonomy file that looks like this:
A82902A now;maybe;some;taxonomy;information

You can also look at the “data resources” page at qiime2.org to see some examples (look at the Greengenes and SILVA reference databases and not the pre-trained classifiers)

Good luck!

2 Likes

Hi everybody,

In the end, I managed to create a classifier with table that i had created myself.
However, now, I am trying to classify with the sklearn method. first, I got an error message that the classifier does not support confidence values. I disabled it:

qiime feature-classifier classify-sklearn
–i-reads rep-seqs-dada2.qza
–i-classifier classifier2017.qza
–p-confidence disable
–o-classification taxonomy_2017.qza

and still it says that it does not support.
I am also attaching a debug file (I am using QIIME 2 2019.10).

Thanks and stay healthy.

Traceback (most recent call last):
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2cli/commands.py”, line 328, in call
results = action(**arguments)
File “</home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/decorator.py:decorator-gen-347>”, line 2, in classify_sklearn
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 240, in bound_callable
output_types, provenance)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 383, in callable_executor
output_views = self._callable(**view_args)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_feature_classifier/classifier.py”, line 215, in classify_sklearn
reads, classifier, read_orientation=read_orientation)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_feature_classifier/classifier.py”, line 170, in _autodetect_orientation
result = list(zip(*predict(first_n_reads, classifier, confidence=0.)))
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_feature_classifier/_skl.py”, line 45, in predict
for chunk in _chunks(reads, chunk_size)) for m in c)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/joblib/parallel.py”, line 1003, in call
if self.dispatch_one_batch(iterator):
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/joblib/parallel.py”, line 834, in dispatch_one_batch
self._dispatch(tasks)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/joblib/parallel.py”, line 753, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/joblib/_parallel_backends.py”, line 201, in apply_async
result = ImmediateResult(func)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/joblib/_parallel_backends.py”, line 582, in init
self.results = batch()
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/joblib/parallel.py”, line 256, in call
for func, args, kwargs in self.items]
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/joblib/parallel.py”, line 256, in
for func, args, kwargs in self.items]
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_feature_classifier/_skl.py”, line 52, in _predict_chunk
return _predict_chunk_with_conf(pipeline, separator, confidence, chunk)
File “/home/qiime2/miniconda/envs/qiime2-2019.10/lib/python3.6/site-packages/q2_feature_classifier/_skl.py”, line 68, in _predict_chunk_with_conf
raise ValueError(‘this classifier does not support confidence values’)
ValueError: this classifier does not support confidence values

Hi @EGvibrio,

:partying_face:

As I think you’ve probably read elsewhere on the forum, this is because your custom taxonomy file contains uneven ranks. If it is at all possible to create even taxonomic ranks (i.e., same number of semicolon-delimited levels in each reference taxonomy label), this would be the “best” way to address this, if potentially more time-consuming (best because then you can classify with confidence, instead of just finding the top hit).

confidence=disable is the quick way to address this, and should work… so the fact that you get this error is a bit concerning (possible bug). So let’s troubleshoot a little bit. Could you do the following:

  1. Could you please install and run with version 2020.2? We will need to work off of the latest release to debug, just to make sure this is not an issue with an outdated release.
  2. Use the --p-read-orientation to set the read orientation of your query sequences relative to your reference sequences. If this runs, I’ve figured out where the bug is creeping in.

Thanks for reporting!

Thank you for your help.
I downloaded a newer version 2020.2, still it does not work. I also have set the read orientation i got an error message:

index 1 is out of bounds for axis 0 with size 1

and here is the debug:

Traceback (most recent call last):
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py”, line 328, in call
results = action(**arguments)
File “</home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/decorator.py:decorator-gen-343>”, line 2, in classify_sklearn
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 245, in bound_callable
output_types, provenance)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 390, in callable_executor
output_views = self._callable(**view_args)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py”, line 220, in classify_sklearn
seq_ids, taxonomy, confidence = list(zip(*predictions))
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/_skl.py”, line 46, in predict
for calculated in workers(jobs):
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/parallel.py”, line 1004, in call
if self.dispatch_one_batch(iterator):
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/parallel.py”, line 835, in dispatch_one_batch
self._dispatch(tasks)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/parallel.py”, line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/_parallel_backends.py”, line 209, in apply_async
result = ImmediateResult(func)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/_parallel_backends.py”, line 590, in init
self.results = batch()
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/parallel.py”, line 256, in call
for func, args, kwargs in self.items]
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/parallel.py”, line 256, in
for func, args, kwargs in self.items]
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/_skl.py”, line 52, in _predict_chunk
return _predict_chunk_without_conf(pipeline, chunk)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/_skl.py”, line 59, in predict_chunk_without_conf
y = pipeline.predict(X)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/utils/metaestimators.py”, line 116, in
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py”, line 420, in predict
return self.steps[-1][-1].predict(Xt, **predict_params)
File “/home/qiime2/miniconda/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/naive_bayes.py”, line 78, in predict
return self.classes
[np.argmax(jll, axis=1)]
IndexError: index 1 is out of bounds for axis 0 with size 1

Is there other classifier i should try? or how do i fix it?

Thank you, I really appreciate that.

Hi @EGvibrio,
So this is progress; it looks like setting the read orientation is letting you pass to the next step, but now I think the issue appears to be with the query sequences, not the reference database.

The error message seems to imply that one or more of your query sequences are missing. What are the lengths of these sequences? How many are there? Could you run qiime feature-table tabulate-seqs and share the result? (you can send directly to me if you don’t want to share publicly)

Hi @EGvibrio,
Thanks for sending your QZV. Nothing looks out of the ordinary with your query sequences!

So now we are back to the original issue: formatting issues with the reference database are most likely to blame. Formatting these files to specification can be a challenging job! If you want to send me your reference sequences and taxonomy I can take a look at those and try to replicate this error…

However, to get moving ahead with your analysis I recommend using classify-consensus-vsearch instead. If you are disabling the confidence estimation with classify-sklearn then you sort of negate some of the benefits of using that method, and the method will just grab the top hit as the result. At that point, you may as well use classify-consensus-vsearch --p-maxhits 1 to just do top-hit alignment. If that method raises an error, then it is good indication that there are serious formatting issues with the reference database, causing issues with both classification methods.

Hi @Nicholas_Bokulich

Thank you. I downloaded a reference database that did not come with the taxonomy table. I re-organized the fasta file and copied the names of sequences to the text file. I think the problem came from the text file. I will send you the files. Thank you.

Just an update for anyone reading: the problem appears to have been caused by formatting issues in the taxonomic database, specifically that the taxonomic labels (not IDs) consisted of integer values.

@EGvibrio please follow up if you confirm that replacing these integers fixes the problem, or if you have any more issues or questions. Thanks!