Suggestions for using nifH ARB database for taxonomy assignment in QIIME2

Hi @Mitra_Ghotbi,

I exported your fungene_8.1_nifH_ref_unaligned_nucleotide_seqs.qza as a FASTA file. The issue is that the FASTA headers (a few output below):

>821566130location=complement(110918..111811),organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductaseNifH

are different from your taxonomy headers:

821566130       k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Rhizobiaceae; g__Rhizobium; s__phaseoli Ch24-10

I was going to suggest that you either need to remove everything after the >821566130 in the FASTA header, or simply insert a space after 821566130. However, this won't work as there appears to be multiple sequences with the same ID, i.e. there are three sequence entries for 821566130:

>821566130location=complement(110918..111811),organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductaseNifH
>821566130location=68723..69616,organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductaseNifH
>821566130location=complement(16224..17117),organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductase

Each sequence ID must be unique and correspond to a single unique taxonomy ID. For more details on appropriate IDs, see here.

If you'd like to keep all the sequences, then I'd arbitrarily increment the ID like so:

821566130
821566130.1
821566130.2

or append the location information like this (I am adding 'c' to denote compliment). This is similar to how SILVA, and other databases, handle multiple gene copies from the same organism:

821566130.c110918
821566130.68723
821566130.c16224

Then make sure the IDs in the taxonomy file match those in the sequence file. Then you should be good to go.

Alternatively you can try RESCRIPt, to make your own nifH reference database. You can look through this tutorial .