NifH database for taxonomic assignment in qiime2

JThurston · January 19, 2024, 8:05pm

Hi all!

I'm currently working through a Qiime2 pipeline analysing Illumina miseq paired-end amplicon data. I've successfully analysed bacterial (16s) amplicons from importing, filtering through to taxonomic assignment. However, I also have amplicon data for a functional gene (nifH; nitrogenase for N2 fixation) however I'm struggling to find an appropriate method of assigning taxonomy.

For the 16s amplicons, I used the Greengenes2 database (Index of /greengenes_release/2022.10), and for niFH, the suitable sequence database (also dada2 compatible) was made by Heller et al. 2014 and has since been updated (GitHub - moyn413/nifHdada2: dada2 formatted nifH database). I was able to work through extracting 16s sequences from the Greengenes2 database, training a classifier, and then assigning taxonomy to my representative sequences, however I'm not sure where to begin with using the above nifH database to assign taxonomy. Could anyone help? See below with the commands I ran for the 16s analysis. I basically want to do this, but for my nifH amplicons! Thanks!

echo "extracting reads from database"
qiime feature-classifier extract-reads
--i-sequences Databases/Greengenes2/2022.10.backbone.full-length.fna.qza
--p-f-primer GTGYCAGCMGCCGCGGTAA
--p-r-primer GGACTACNVGGGTWTCTAAT
--p-trunc-len 250
--o-reads gg2-v4-bacteria.ref.seqs.qza
--p-n-jobs 12

echo "training classifier on extracted database reads"
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads gg2-v4-bacteria.ref.seqs.qza
--i-reference-taxonomy Databases/Greengenes2/2022.10.backbone.tax.qza
--o-classifier gg2-v4-bacteria.classifier.trained.qza

echo "assigning taxonomy"
qiime feature-classifier classify-sklearn
--i-classifier gg2-v4-bacteria.classifier.trained.qza
--p-confidence 0.7
--i-reads rep-seqs-dada2-bacteria.qza
--o-classification taxonomy-bacteria.sklearn.qza
--p-n-jobs 12

SoilRotifer · January 19, 2024, 9:01pm

Hi @JThurston,

It appears their FASTA file is simply made from their nifH_dada2_phylum_v2.0.5.csv file. I used the following commands below to successfully construct a classifier based on this data:

Step 1: make taxonomy file:
The commands are, generating an empty file called nifH_dada2_v2.0.5_tax.tsv, then adding the required headers"Feature ID (tab space) Taxon". Then we are skipping the first line of the CSV file and writing out columns 1 (label) and 5 (taxonomy) delimited by a ,. After we have this, we then replace the , with a tab-character.

    touch nifH_dada2_v2.0.5_tax.tsv
    echo -e "Feature ID\tTaxon" >> nifH_dada2_v2.0.5_tax.tsv
    tail -n +2 nifH_dada2_phylum_v2.0.5.csv | cut -f 1,5 -d ',' | sed 's/,/\t/' >> nifH_dada2_v2.0.5_tax.tsv

Step 2: make FASTA file
Here we perform similar tasks as with the taxonomy, except we extract columns 1 (label) and 2 (sequence). Then we write to file with a > in front of each label, then the sequence below each label

    tail -n +2 nifH_dada2_phylum_v2.0.5.csv | cut -f 1,2 -d ',' | sed 's/^/>/' | sed 's/,/\n/'  > nifH_dada2_v2.0.5_seqs.fasta

Step 3: import into QIIME 2

    qiime tools import \
        --input-path nifH_dada2_v2.0.5_tax.tsv \
        --type 'FeatureData[Taxonomy]' \
        --output-path nifH_dada2_v2.0.5_tax.qza

    qiime tools import \
        --input-path nifH_dada2_v2.0.5_seqs.fasta \
        --type 'FeatureData[Sequence]' \
        --output-path nifH_dada2_v2.0.5_seqs.qza

Step 4: train classifier

    qiime feature-classifier fit-classifier-naive-bayes \
        --i-reference-reads nifH_dada2_v2.0.5_seqs.qza \
        --i-reference-taxonomy nifH_dada2_v2.0.5_tax.qza \
        --o-classifier nifH_dada2_v2.0.5_classifier

I skipped the extract-reads step for this example. But this should get you started.

JThurston · January 23, 2024, 5:06pm

Hey @SoilRotifer!

Thanks so much for your (quick!) response. This worked perfectly, thank you.

I'm now struggling to get many assigned taxa to my representative seqs from the dada2 output. I have 6038 unique ASVs, but after taxonomic assignment, I'm only getting 20 taxa present in my bar plots, even at species level (taxonomic level 7). I was also missing a lot of species I was expecting to see, for example Cyanobacteria were relatively high in my bacterial (16s) ampicons, but were only >1% of nifH reads.

As all the sequences in this database are just the nifH sequence, I didn't think the Extract-reads step was necessary. Is this correct? I did try to run this but it ran infinitely.

After training the classifier, I assigned taxonomy using the following;

qiime feature-classifier classify-sklearn
--i-classifier nifH_dada2_v2.0.5_classifier.qza
--p-confidence 0.7
--i-reads rep-seqs-dada2-nifh.qza
--o-classification taxonomy-nifh.sklearn.qza
--p-n-jobs 12

Let me know if you have any ideas why I might be getting very low assigned taxa.

Thanks!
Josh

SoilRotifer · January 23, 2024, 6:43pm

What primer pair did you use to generate the data? Many primer pairs are "leaky" and can amplify off-targets. This is quite common.

So, what do you mean by "only 20 taxa", do you mean only 20 sequences are being classified, or do you mean that many sequences are being classified to 20 taxa? Are you sure the primer pair you are using can equally amplify all the taxa of interest? or are even expected in those samples?

Did you remove chloroplast and mitochondria from your 16S data as outlined here? The reason why I suggest this, is that Chloroplasts evolved from Cyanobacteria. If you only observe the phylum-level taxonomy you might mistake the chloroplast sequences as cyanobacteria, as chloroplasts are defined as being within the phylum Cyanobacteria. Which could partly explain the lack of correspondence between the 16S rRNA gene and NifH data.

Also, keep in mind, not all primer sets are equally efficient in an eDNA setting. It is common to see differences in the number of reads generated, by one primer set (e.g. 16S) versus another primer set (e.g. NifH) for a particular taxonomic group.

JThurston · January 23, 2024, 7:53pm

Hiya,

All excellent points, thank you. I used the very degenerate DVV/IGK3 primers, so I was expecting some amplification of off-targets. That being said, I was also expecting a greater diversity of taxa after taxonomic assignment.

Re 20 taxa; I have 6000 ASVs which were classified to just 20 taxa. This includes taxa which have been assigned to just a high taxonomic group (e.g. Clostridia class) and lower groups within that class, e.g. Lachnospiraceae and Clostridiaceae families. The below bar-chart is of species level for all my samples.

Also, I'm unsure how many of my total sequences were actually assigned to taxa, this might be a next step. Is there a way I could check this? It'd be nice to know as a percentage how many sequences/ASVs were able to be assigned a taxonomy.

Thanks for your help!

SoilRotifer · January 23, 2024, 10:46pm

If you look at the NifH database, you'll notice that the taxonomy only goes to genus level. Thus, I do not think it is possible to obtain species-level assignments with this database, as far as I can tell.

The only thing that might be worth looking into, is to see if your sequences were generated in a mixed orientation? See here. It is not an easy fix, but you can start with vsearch as described in that thread.

I'm not as familiar with the ability of NifH to taxonomically resolve taxa. Perhaps others on the forum can point you in a better direction.

JThurston · January 30, 2024, 12:03pm

Thanks for the suggestions! I'll take another look and see if the sequences were generated in a mixed orientation.

On a separate note, is it possible to show the proportion or percentage of sequences within a sample that were successfully assigned a taxa during the 'qiime feature-classifier' step? This will confirm to me that the taxa seen in the barplots are indeed the true nifH community, whereas I currently don't know whether it's just that the majority of my sequences were unable to be assigned a taxonomy. Does this make sense?

Thanks for your help!

cherman2 · February 9, 2024, 6:05pm

Hi @JThurston

qiime2 taxabar plot could help with this. If you generate a taxabar plot, you could look for the percentage of unassigned features.

But you should be able to do this in R or python by counting the number of features labeled unassigned or d_Bacteria.

Sorry to not be more help!