Unable to import fasta file into QIIME

bkramer · October 1, 2021, 10:31pm

Hello,

I am trying to import a reference database for the gene nifH into QIIME that is available on this github page (https://github.com/moyn413/nifHdada2), specifically: nifH_dada2_phylum_v1.1.0.fasta

However, when I use the following command in QIIME:

qiime tools import --type 'FeatureData[Sequence]' --input-path /home/qiime2/Desktop/ZehrTaxonomy/nifH_dada2_phylum_v1.1.0.fasta --output-path /home/qiime2/Desktop/ZehrTaxonomy/Zehr_nifHTaxonomy.qza

I get the following error:

There was a problem importing /home/qiime2/Desktop/ZehrTaxonomy/nifH_dada2_phylum_v1.1.0.fasta:

/home/qiime2/Desktop/ZehrTaxonomy/nifH_dada2_phylum_v1.1.0.fasta is not a(n) DNAFASTAFormat file:

ID on line 9 is a duplicate of another ID on line 7.

I'm QUITE certain that this is a formatting issue, but I'm not certain how to solve it...

Any help would be greatly appreciated!!

Ben

thermokarst · October 4, 2021, 2:40pm

Hi @bkramer!

Yes, you're right, it is a formatting issue. Let's take a closer look at the error message:

Okay, well, let's peek at lines 7 and 9:

It looks like QIIME 2 wasn't kidding when it said there were duplicate IDs on those lines! I went ahead and highlighted the ID, for clarity - you can see those two IDs are identical.

In QIIME 2 the expectation for a FeatureData[Sequence] Artifact is that each entry (Feature) is uniquely identified.

Is it possible for you to get your hands on a de-replicated version of this database?

Alternatively, you could script out a solution that splits this input into two files: the FeatureData[Sequence] and the FeatureData[Taxonomy], although that is a little bit outside of our scope here, so we might not be able to provide much help there.

Keep us posted!

bkramer · October 6, 2021, 10:44pm

Thanks so much for your help!

I suspected that this might be due to their being no sequence IDs, so there will inevitable be some sequences that are matched to the same taxonomic level.

A postdoc in our lab was able to add a makeshift ID for each sequence following the lowest taxonomic level identified, specifically typing in ";strainX", where X is just the number (1,2,3) sequence starting from top to bottom in the file.

Unfortunately, when I try to import the file into QIIME, I get this error:

There was a problem importing /home/qiime2/Desktop/ZehrTaxonomy/nifH_dada2_uniq.fasta:

** /home/qiime2/Desktop/ZehrTaxonomy/nifH_dada2_uniq.fasta is not a(n) DNAFASTAFormat file:**

** Multiple consecutive descriptions starting on line 19**

Is there something wrong with this approach?

thermokarst · October 8, 2021, 11:21pm

There are sequence IDs - the taxon string!

Sounds like the script used to generate this file is creating invalid FASTA, the error is telling you as much:

That's saying that starting on line 19 there are multiple lines that begin with >.

system · November 9, 2021, 5:22am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.