Suggestions for using nifH ARB database for taxonomy assignment in QIIME2

Hi
I am new to metagenomics data analysis.
Currently I am working on nifH amplicon data.
I have done quality filtering, reads merging, denoising and ASV construction using DADA2 denoising plugin in QIIME2.
Now I have to do taxonomic assignment to ASV's. Currently there is no QIIME compatible sequence and taxonomy reference database available for nifH functional gene.

There's a nifH gene sequence database constructed by jzehr lab which is in .arb format. My question is how to use such kind of databases which are in arb format to assign taxonomy. Is there any specific tool or method?

Has anyone used this database for taxonomic assignment?
Can anyone suggest me or guide me as to how can I do taxonomic assignment for nifH amplicon data ?
OR
How can I construct a QIIME compatible reference sequence and taxonomy database for nifH gene?
Apologies if I have posted a wrong topic, I have seen some already discussed topics related to this topic..but couldn't find any specific solution. I am asking here to get some suggestions and guidance from people who have already done such kind of work with nifH data.

Thanks in advance.

Hi @vkk_24!

Not one that we release on the QIIME 2 website, but others must be out there, since there have been a few forum users who report using nifH in QIIME 2.

I recommend contacting the makers of that database to ask them if they have any ideas, they would have the best advice on how to convert to fasta and extract taxonomy information. They might even be interested in making a Q2 compatible release...

Here is another forum topic that is pretty similar to yours and answers at least the second of these questions (and requires starting with fasta). It sounds like @EGvibrio was using a custom database — nevertheless, perhaps @EGvibrio has some advice or has worked with the jzehr nifH database? :

I have not worked with nifH personally, but know of some nifH mock communities on mockrobiota, where the contributor recommended another nifH database released by jzehr, so maybe this would be something useful?: https://wwwzehr.pmc.ucsc.edu/CART_model_public/

Let's see if @EGvibrio or others who have worked with nifH might have any advice!

2 Likes

Thanks @Nicholas_Bokulich,

Hi @vkk_24,

I was in the same boat like you. I have the fasta file so you can use it for QIIME2 and I can send you if needed. But because there is not a taxonomic table, I built myself the table by blasting sequences on NCBI. If you want I can send you the code. Or, the qza file so you don’t have to bother yourself with the taxonomic assignment.

3 Likes

Hi @EGvibrio ,
could u also send me a copy of the reference fasta file and the taxonomy table?
also code is fine . :joy:

Hi @EGvibrio
could u also send me the qza file?
thank you!

Hi. My name is Cinthya
I am working on nifH amplicon data.
Now I have to do taxonomic assignment to ASV's.
I wonder if you could also send me a copy of the reference fasta file and the taxonomy table to do the taxonomy assignment and follow the same format. Thank you very much!

Hi @cinthya_vieyra

Did you get the nifH reference sequence and taxonomy files. If not I'll be happy to share with you. I have created a QIIME2 compatible sequence and taxonomy file.

1 Like

Hi @vkk_24 I am working on the nifH sequences now. Could please also share the qza files. Many thanks!

Hi @vkk_24! I've encountered a similar problem with classifying nifH amplicon sequences. I'm wondering if you would be willing to share copies of QIIME2 compatible files that you created with me? Or are they publicly available somewhere else? Thank you!

Hi @EGvibrio
Could you share your fasta file and code to all? :smile: we do appreciate it !!

Hi everyone,
I have successfully build my ref and taxonomy table and transfers to the qza though my current issue is the error in classifier stage

Plugin error from feature-classifier:

not enough values to unpack (expected 2, got 0)

Debug info has been saved to /tmp/qiime2-q2cli-err-wig3he0i.log

I've attached both files if this would be helpful and here is the script I used

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads fungene_8.1_nifH_unaligned_nucleotide_ref_seqs2.fasta_dev_uppercase.qza
--i-reference-taxonomy nifH_K_2310.qza
--o-classifier fungene-classifier-k.qza

Any help with
fungene_nifH_K_taxonomy.txt (321.0 KB)
nifH_Kent_2310.txt (305.9 KB)
solving this problem is highly appreciated

Can you share associated sequence fasta or QZA file? If these are too big, feel free to DM me with links to the data (e.g. Dropbox, etc...)

Hi @SoilRotifer ,
Thank you so much for your prompt response. Sure I've attached both qza files (ref read and taxa file).

Please let me know if any other info is required to solve this puzzle
Thanks again for your support

nifH_tax_K.qza (34.5 KB)
fungene_8.1_nifH_ref_unaligned_nucleotide_seqs.qza (279.4 KB)

Hi @Mitra_Ghotbi,

I exported your fungene_8.1_nifH_ref_unaligned_nucleotide_seqs.qza as a FASTA file. The issue is that the FASTA headers (a few output below):

>821566130location=complement(110918..111811),organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductaseNifH

are different from your taxonomy headers:

821566130       k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Rhizobiaceae; g__Rhizobium; s__phaseoli Ch24-10

I was going to suggest that you either need to remove everything after the >821566130 in the FASTA header, or simply insert a space after 821566130. However, this won't work as there appears to be multiple sequences with the same ID, i.e. there are three sequence entries for 821566130:

>821566130location=complement(110918..111811),organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductaseNifH
>821566130location=68723..69616,organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductaseNifH
>821566130location=complement(16224..17117),organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductase

Each sequence ID must be unique and correspond to a single unique taxonomy ID. For more details on appropriate IDs, see here.

If you'd like to keep all the sequences, then I'd arbitrarily increment the ID like so:

821566130
821566130.1
821566130.2

or append the location information like this (I am adding 'c' to denote compliment). This is similar to how SILVA, and other databases, handle multiple gene copies from the same organism:

821566130.c110918
821566130.68723
821566130.c16224

Then make sure the IDs in the taxonomy file match those in the sequence file. Then you should be good to go.

Alternatively you can try RESCRIPt, to make your own nifH reference database. You can look through this tutorial .

Thanks a million for the great direction. I will try to remove the duplicates using the path you showed. Hopefully, it works.

Thanks again

Hi @SoilRotifer,

Thanks to your direction, I have gone successfully through the whole process of removing duplicates using seqkit, yet the issue is there. I have attached the qza and fast. I highly appreciate your help and your time.

Thank so much!
Mitra

fungene_nifH_MG_unaligned_cleaned_seqs.fasta.gz (118.1 KB)
MG.nifH_taxonomy.qza (34.5 KB)
nifH_ref_unaligned_MG.seqs.qza (274.8 KB)
TAXAMG.txt (328.4 KB)

Hi @Mitra_Ghotbi,

I think you skipped a step. In the FASTA header, we need to either remove everything after the ID name, i.e. 821566130.3 or insert a space after the ID (often tools sill observe the any text before the first whitespace character as the ID for the sequence). That is, your FASTA header should look like either this:

>821566130.3
ATGTCAGATTTGCGTCAAATCGCATTTTACGGCAAAGGGGGGATCGGCAAGTCCACCACC...

or this:

>821566130.3 Location=complement(16224..17117),organism=RhizobiumphaseoliCh24-10,definition=nitrogenasereductase
ATGTCAGATTTGCGTCAAATCGCATTTTACGGCAAAGGGGGGATCGGCAAGTCCACCACC...

Also, there are other duplicates in your sequence file. I've listed them below along with the number of times they appear in the FASTA file:

6 >192282182
4 >89332194
4 >315599110
4 >311217923
4 >115515977
3 >90103542
3 >75699950
3 >593022864
3 >39648199
3 >381356398
3 >298231532
3 >219567067
3 >193087197
3 >186463002
3 >146403799
3 >146189981
3 >145554299
2 >86570155
2 >78170183
2 >78165794
2 >6690674
2 >573471959
2 >564888430
2 >564881603
2 >551618373
2 >47118302
2 >4490568
2 >39649375
2 >380683448
2 >378401447
2 >365275296
2 >358635055
2 >343801510
2 >341821300
2 >337757426
2 >336293797
2 >333820987
2 >333762994
2 >333738032
2 >332275097
2 >328927127
2 >315622691
2 >315468987
2 >315447000
2 >312441806
2 >310621426
2 >302191744
2 >295107714
2 >290781780
2 >242129517
2 >241887944
2 >221158880
2 >219544946
2 >219536331
2 >21672293
2 >194310671
2 >194307659
2 >193085153
2 >189419341
2 >184198282
2 >171696369
2 >154158043
2 >152206095
2 >148566298
2 >145204986
2 >126102442
2 >119353206
2 >116077928
2 >114536971

Remember for each entry in the sequence file you need a corresponding entry within the taxonomy file.

-Cheers!
-Mike

Appreciate your super quick response and great patience and direction. I will recheck the files, I absolutely forgot the taxa file matching

Thanks!
Mitra

Hi @SoilRotifer ,
I can't thank you enough. I eventually built my classifier and assigned the nifH taxonomy. Appreciate your support and direction and I will share the link to the qza files and codes I used to construct the classifier with Qiime2 users.

Cheers!
Mitra

1 Like

Yay @Mitra_Ghotbi! :tada: :fireworks:

I'm glad you have a working classifier! :slight_smile:

1 Like