[Newbie] HELP in making a classifier from a FASTA and taxonomy file from an online database

Hi, I am very new to Qiime2 and I would like to create a classifier for the pmoA gene. I downloaded a .fasta and.tax file from an online database [pmoA gene reference database (fasta-formatted sequences and taxonomy)] .

Link to the actual file I downloaded [Index of /download/10.5880.GFZ.5.3.2016.001]

As instructed in this tutorial [Training feature classifiers with q2-feature-classifier — QIIME 2 2018.2.0 documentation], I am trying to import the .fasta file using this script:

qiime tools import
--type 'FeatureData[Sequence]'
--input-path taxonomy/pmoa7809.fasta
--output-path taxnomy/pmoa7809.qza

However, I got this error message:

"There was a problem importing taxonomy/pmoa7809.fasta:
taxonomy/pmoa7809.fasta is not a(n) DNAFASTAFormat file:

Invalid character '.' at position 0 on line 10 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTRYKMSWBDHVN."

To be honest, I am not even sure what is the error here since the author of the database clearly states that this is compatible with qiime2 and the fasta file looks normal to me.

Anyway, I am running this script using Qiime2 in a conda environment using the terminal in Ubuntu 20.04.3 LTS

Any help is very appreciated especially I need this file to complete my master thesis. Thank you very much in advance.

Hi @pkmnsandy, welcome to :qiime2:!

You are certainly on the right track, however there seems to be some incorrect formatting with these reference sequence files. As the error message states:

Invalid character '.' at position 0 on line 10 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTRYKMSWBDHVN."

That is, any characters that are not ACGTRYKMSWBDHVN are considered invalid in an unaligned FASTA file. I tried importing as FeatureData[AlignedSequence] on the off chance it was an alignment, or at least the same length. Obviously, this did not work as the sequences are unaligned and vary in length. But it was worth a shot.

Luckily there are only 3 sequences that are miss-formatted, and they are on lines 10, 12, and 36 of the FASTA file. Here is a snippet of the offending sequences.

>EF591086
.NGGGGATTGGTCA...

>EF591087
.NGGAGACTGGAG...

>L40804
.NTCGGACTGGAA...

Open the FASTA file using a basic text editor like notepad, BBEdit, etc... Then simply remove the leading . in front of these sequences and re-reun the import command as you have it. Then you'll be good to go.

To import the taxonomy file you can run:

qiime tools import --type 'FeatureData[Taxonomy]' \
    --input-path pmoa4rdp_qiime.tax \
    --input-format HeaderlessTSVTaxonomyFormat \
    --output-path pmoa4rdp_qiime.qza

Note the addition of --input-format HeaderlessTSVTaxonomyFormat

Let us know if this works for you.

Also, we have a tool, RESCRIPt, to help with other aspects of making and curating your own reference database. Give it a try. Relevant tutorials are here and here.

-Cheers!
-Mike

Hey just to riff off of this idea:

@pkmnsandy it looks like the files you have consist of a selection of genbank accessions... so you can reformat the taxonomy file to re-download all sequences and formatted taxonomies using RESCRIPt. Why do this? It would give you a little more control (if you want) in deciding which taxonomic ranks you are interested in, and preserve that information in QIIME 2 provenance. Also, you would avoid needing to manually edit the sequences file.

Here I grabbed the first few accessions from the taxonomy file that you linked to. Then I just added a header row, and the taxonomy file now looks like feature metadata that RESCRIPt can parse to download the desired accessions:
pmoa4rdp_qiime.txt (2.9 KB)

Then I input that to RESCRIPt with this action:

qiime rescript get-ncbi-data \
    --m-accession-ids-file pmoa4rdp_qiime.txt \
    --o-sequences /Users/nbokulich/Downloads/seqs.qza \
    --o-taxonomy /Users/nbokulich/Downloads/taxonomy.qza

It worked! But I just tried the first few accessions. If some of the accessions you want are not actually Genbank accessions, RESCRIPt will not be able to grab them with that action.

1 Like

Hi, sorry for the very late reply. I tried this method and now my classifier worked. Thank you so much :slight_smile:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.