Trouble training a classifier - Error "99_otus.fasta is not a(n) DNAFASTAFormat file"

anethdavid · May 23, 2019, 10:42am

Hello,

I hope that you are all doing well.

I'm trying to train a classifier for taxonomic analyses. I have 16s data from soil and would like to use Greengenes. I was following the "Training feature classifiers with q2-feature-classifier¶" tutorial but got stuck very early on when trying to import the database into a qiime artifact.

This is how I downloaded the database:
wget -O "99_otus.fasta" "https://gg-sg-web.s3-us-west-2.amazonaws.com/downloads/greengenes_database/gg_13_5/gg_13_5_otus.tar.gz"

And the taxonomy:
wget -O "99_otu_taxonomy.txt" "https://gg-sg-web.s3-us-west-2.amazonaws.com/downloads/greengenes_database/gg_13_5/gg_13_5_taxonomy.txt.gz"

From website: https://greengenes.secondgenome.com/?prefix=downloads/greengenes_database/gg_13_5/

To import, I used commands:
qiime tools import --type 'FeatureData[Sequence]' --input-path 99_otus.fasta --output-path 99_otus.qza

Which led to the following error:
There was a problem importing 99_otus.fasta:
99_otus.fasta is not a(n) DNAFASTAFormat file

I don't know how to proceed from this. Please help.

Best,
Aneth.

Nicholas_Bokulich · May 23, 2019, 12:33pm

The problem is that you are downloading a tarred/gzipped archive and misleadingly saving it with the extension ".fasta". It is not a fasta because it is not a single file but a whole archive of fasta files! So the format is not correct. Same issue with the taxonomy file.

Solution: use wget to download, then use tar and gunzip to open these archives. Select the appropriate files from within the unzipped directory. open those files first to make sure they are correct; fasta and txt files should be human readable so this is a good way to double-check!

Good luck!

anethdavid · May 23, 2019, 2:49pm

Oh, that makes sense .

It worked, thank you very much! I however extracted reference reads without --p-trunc-len and --p-min-length options because my data is paired-end data and wasn't sure which parameters to use. Is it a must to use them? How does one decide which values to put?

Many thanks,
Aneth.

Nicholas_Bokulich · May 23, 2019, 5:21pm

that's fine. that is actually what is recommended for PE data in the tutorial.

No, see the tutorial for more details.

anethdavid · May 24, 2019, 1:27pm

Thank you very much, I trained the classifier and assigned taxonomy to my representative sequences successfully.

Regards,
Aneth.