[Newbie] HELP in making a classifier from a FASTA and taxonomy file from an online database

SoilRotifer · February 25, 2022, 2:49pm

Hi @pkmnsandy, welcome to :qiime2:!

You are certainly on the right track, however there seems to be some incorrect formatting with these reference sequence files. As the error message states:

Invalid character '.' at position 0 on line 10 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTRYKMSWBDHVN."

That is, any characters that are not ACGTRYKMSWBDHVN are considered invalid in an unaligned FASTA file. I tried importing as FeatureData[AlignedSequence] on the off chance it was an alignment, or at least the same length. Obviously, this did not work as the sequences are unaligned and vary in length. But it was worth a shot.

Luckily there are only 3 sequences that are miss-formatted, and they are on lines 10, 12, and 36 of the FASTA file. Here is a snippet of the offending sequences.

>EF591086
.NGGGGATTGGTCA...

>EF591087
.NGGAGACTGGAG...

>L40804
.NTCGGACTGGAA...

Open the FASTA file using a basic text editor like notepad, BBEdit, etc... Then simply remove the leading . in front of these sequences and re-reun the import command as you have it. Then you'll be good to go.

To import the taxonomy file you can run:

qiime tools import --type 'FeatureData[Taxonomy]' \
    --input-path pmoa4rdp_qiime.tax \
    --input-format HeaderlessTSVTaxonomyFormat \
    --output-path pmoa4rdp_qiime.qza

Note the addition of --input-format HeaderlessTSVTaxonomyFormat

Let us know if this works for you.

Also, we have a tool, RESCRIPt, to help with other aspects of making and curating your own reference database. Give it a try. Relevant tutorials are here and here.

-Cheers!
-Mike