Training classifier with own reference database error Invalid value for '--i-reference-taxonomy': 'ref-taxonomy.gza' is not a valid filepath

Sorry for my previous post, I copied directly from my script and I can now see how awful that looked.
I hope this attempt is better.

Background information
I am using qiime2-2020.8 and accessing this via a conda environment.
I have attempted to search for a similar issue on the forum.

Train feature classifiers with q2-feature-classifier using my own reference database for the functional target gene pmoA (methanotrophs).
Use the trained classifier to assign taxonomy to the methanotrophs in my environmental samples.
To do this, I have a file with representative sequences to which I want to assign taxonomy which I need to load into qiime2 as an artefact. I already did this because I cannot upload the original fasta file
rep_set90nonchimeras.qza (241.9 KB)

I have worked through the online tutorial using the example files provided, so I know the code works.
I then attempted to substitute my own files into the code.

I have a reference database for the functional target gene pmoA (methanotrophs) with the nucleotide sequence and their associated accession number pmoa7809_YangDB.qza (347.5 KB).
I already converted this to a .gza file as I cannot upload the original .fasta file.

I also have the complementary text file containing the accession numbers and the taxonomic lineage
pmoa4rdp1_qiime.txt (1006.1 KB)

I managed to get through all the way until I train the classifier with my own files. It is here that I encounter the following error message,
(1/1) Invalid value for ‘–i-reference-taxonomy’: ‘ref-taxonomy.gza’ is not a
valid filepath

The code I have run
source activate qiime2-2020.8

Importing rep-seqs into qiime2
qiime tools import
–input-path rep_set90nonchimeras.fna
–output-path rep_set90nonchimeras.gza
–type ‘FeatureData[Sequence]’

My representative sequences are now converted to a featureData artefact

Training a classifier in Qiime 2
Create qiime2 artefacts (.gza files)
First we do this for the reference seqs which = the .fasta file
There were some . at the start of sequences in the database, therefore I had to remove these,
sed -i “/^>/! s/.//” pmoa7809_YangDB.fasta

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path pmoa7809_YangDB.fasta
–output-path pmoa7809_YangDB

Do the same for the associated txt taxonomy file which I checked doesn’t have a header and looks same as example file.

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path pmoa4rdp1_qiime.txt
–output-path ref-taxonomy

So, I now have the representative seqs I want to assign tax in the .gza format.
I now also have the reference database seqs and the associated taxonomy to those sequences in .gza format.

Extract reference reads
The notes associated with this section suggest not to include the min and max length and the --truncated option when using paired end for non-tax gene.

qiime feature-classifier extract-reads
–i-sequences pmoa7809_YangDB.qza
–o-reads pmoA_ref-seqs

Train the classifier

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads pmoA_ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.gza
–o-classifier classifier.qza

Error message
What is the exact error message? If you didn’t run the command with the --verbose flag, please re-run and copy-and-paste the results.
There was a problem with the command:
(1/1) Invalid value for ‘–i-reference-taxonomy’: ‘ref-taxonomy.gza’ is not a
valid filepath

I have tried with various iterations of the output filenames with and without the output filename as

For example,
When I look at the file imported when typing,
qiime tools import
–type ‘FeatureData[Sequence]’
–input-path pmoa7809_YangDB.fasta
–output-path pmoa7809_YangDB.gza

In the directory, this file is called pmoa7809_YangDB.gza.gza

Therefore I chose not to add the extension on to the output file name.
qiime tools import
–type ‘FeatureData[Sequence]’
–input-path pmoa7809_YangDB.fasta
–output-path pmoa7809_YangDB

So then when I look at this file in the directory, I see pmoa7809_YangDB.gza

I wonder if there is an issue occurring here.
Please let me know if this is clearer?

Hi @XrandallX - can you help us out and format and edit this post? I am having a very difficult time trying to understand what you are asking for here. Please format your post using the following guide:

You can edit the post by clicking the “edit this post” button:

Once you have done that please reply to this post to let us know that you have cleaned this up a bit (and of course, please reply if you have any questions about formatting and editing, we are happy to help).


Hello @thermokarst
I have tried to tidy up my initial post. I am so sorry for the state of that.
Please let me know if this is better or you still need clarification.


Hi @XrandallX,

Choosing to not using the extension in the output file name is absolutely fine, but I would expect the qiime2 output to end with ‘qza’ (or ‘qzv’), did you rename them to match ‘gza’?

Could you type the command ‘ls’ in your working directory and copy-paste the result?

Hope it helps

Dear @llenzi,
Thank you for the suggestion.
I started completely again with new versions of my files and this has now worked!
The issue now lies with my version of the rep-seqs.gza file when I want to test the classifier/assign taxonomy to my unknown OTUs.

My version of this file has not gone through the processing steps detailed here
Instead, I re-ran the clustering of OTUs using my PI’s pipeline and this generated a representative sequences file.

I am unsure if my import method as a qiime2 artefact is correct.
The way I have done it has used code from the Pre-feature unaligned sequence data (i.e representative FASTA sequences) stage of the importing tutorial page.

For my original .fna file verison, I recieved an error message with the import of rep_set90nonchimeras.fna
I ran the following code to change the lowercase letters to uppercase as that was the error.
sed -i “/^>/! {s/n/-/g; s/(.*)/\U\1/g}” rep_set90nonchimeras.fna

I then re-ran the code
qiime tools import
–input-path rep_set90nonchimeras.fna
–output-path rep_set90nonchimeras.gza
–type ‘FeatureData[Sequence]’

This weirdly produced a file in my directory called rep_set90nonchimeras.gza.gza
If I re-run the code missing the .gza for the output file,
qiime tools import
–input-path rep_set90nonchimeras.fna
–output-path Xrep_set90nonchimeras
–type ‘FeatureData[Sequence]’

I get a file name I expect Xrep_set90nonchimeras.gza

I believe the input file is where the error lies. I have uploaded both output files from the import attempts for your reference.
Kate rep_set90nonchimeras.gza.qza (241.9 KB) Xrep_set90nonchimeras.qza (241.9 KB)

Hi @XrandallX

This weirdly produced a file in my directory called rep_set90nonchimeras.gza.gza
If I re-run the code missing the .gza for the output file,

The normal extension for the qiime2 artifact is ‘qza’, so if you don’t specify it in the output name, that is the extension you expect to get, something like ‘rep_set90nonchimeras.qza’ in your case

If you specify a different extension in your output, the ‘qza’ will be added at the end of your output file name, because qiime2 can not accept any different one, so in your case I would expect “rep_set90nonchimeras.gza.qza”.

I suspect there is a bit of naming confusion which is reflected in the error you get, which is basically saying that qiime2 can not find the file with the name you specified!

Could you please type the ‘ls’ command and past the result in here? So we can see which files are in your working directory.

Hope it helps

This was a silly error on my part. Instead of typing .qza, I typed .gza but I just couldn’t recognise it because I had been staring at the code for too long.
I managed to get the classifier trained to my own reference database and assigned taxonomy to the OTUs. Thanks so much!