Error importing rrn fasta as FeatureData[Sequence]

Hi all,
I am sure this is a simple fix, but for the life of me I can't see the solution!
I am trying to build a database from the rrn DB using the rrn_5.9 FASTA file, but first want to dereplicate the sequences using rescript. So I need to import the FASTA file into qiime, but I am getting an error that the FASTA file is not a DNA FASTA format:

qiime tools import --input-path rrnDB-5.9_16S_rRNA.fasta --output-path rrnDB_16S_rRNA_input.qza --type 'FeatureData[Sequence]'

There was a problem importing rrnDB-5.9_16S_rRNA.fasta:

rrnDB-5.9_16S_rRNA.fasta is not a(n) DNAFASTAFormat file:

ID on line 21 is a duplicate of another ID on line 1.

Here is a snapshot of the FASTA file header:

Methanobacterium formicicum|GCF_000762265.1|NZ_CP006933.1|Chromosome: CP006933.1|283389..284864 +
AGTCCGTTTGATCCTGGCGGAGGCCACTGCTATTGGGTTTCGATTAAGCCATGCAAGTCGAA

I have also tried importing a FASTA file from ncbi with the following header, and also get an error on the FASTA file format:

NR_177367.1 Natronocalculus amylovorans strain AArc-St2 16S ribosomal RNA, partial sequence
CCTGCCGGAGGTCATTGCTATTGGGATTCGATTTAGCCATGCTAGTTGTACGAGTTTATACTCGTAGCGGAAAGCTCAGT

Can anyone advise how the FASTA file header should be formatted? Or which import command change I should make?
-Michelle

Hello @michb,

"ID on line 21 is a duplicate of another ID on line 1." This is the important part--the file you're trying to import has duplicated headers. Take a look at the headers on these lines and see if they are in fact duplicated or if a formatting issue is making the parser think they are. In general the restrictions on the headers are pretty minimal: they can't be repeated, must start with a ">" and can't be empty, I think that's it. As far as what happened in your second example, I won't be able to say unless you post the resulting error.

2 Likes

Thanks @colinvwood for your quick reply - I really missed that that was a single error message! I can see now that the ">" character did not paste into my initial question - although I can see it in the FASTA file. I'll have to review the header formatting carefully.