I am trying to import a reference database for the gene nifH into QIIME that is available on this github page (https://github.com/moyn413/nifHdada2), specifically: nifH_dada2_phylum_v1.1.0.fasta
However, when I use the following command in QIIME:
It looks like QIIME 2 wasn't kidding when it said there were duplicate IDs on those lines! I went ahead and highlighted the ID, for clarity - you can see those two IDs are identical.
In QIIME 2 the expectation for a FeatureData[Sequence] Artifact is that each entry (Feature) is uniquely identified.
Is it possible for you to get your hands on a de-replicated version of this database?
Alternatively, you could script out a solution that splits this input into two files: the FeatureData[Sequence] and the FeatureData[Taxonomy], although that is a little bit outside of our scope here, so we might not be able to provide much help there.
I suspected that this might be due to their being no sequence IDs, so there will inevitable be some sequences that are matched to the same taxonomic level.
A postdoc in our lab was able to add a makeshift ID for each sequence following the lowest taxonomic level identified, specifically typing in ";strainX", where X is just the number (1,2,3) sequence starting from top to bottom in the file.
Unfortunately, when I try to import the file into QIIME, I get this error:
There was a problem importing /home/qiime2/Desktop/ZehrTaxonomy/nifH_dada2_uniq.fasta:
** /home/qiime2/Desktop/ZehrTaxonomy/nifH_dada2_uniq.fasta is not a(n) DNAFASTAFormat file:**
** Multiple consecutive descriptions starting on line 19**