I'm currently working through a Qiime2 pipeline analysing Illumina miseq paired-end amplicon data. I've successfully analysed bacterial (16s) amplicons from importing, filtering through to taxonomic assignment. However, I also have amplicon data for a functional gene (nifH; nitrogenase for N2 fixation) however I'm struggling to find an appropriate method of assigning taxonomy.
For the 16s amplicons, I used the Greengenes2 database (Index of /greengenes_release/2022.10), and for niFH, the suitable sequence database (also dada2 compatible) was made by Heller et al. 2014 and has since been updated (GitHub - moyn413/nifHdada2: dada2 formatted nifH database). I was able to work through extracting 16s sequences from the Greengenes2 database, training a classifier, and then assigning taxonomy to my representative sequences, however I'm not sure where to begin with using the above nifH database to assign taxonomy. Could anyone help? See below with the commands I ran for the 16s analysis. I basically want to do this, but for my nifH amplicons! Thanks!
It appears their FASTA file is simply made from their nifH_dada2_phylum_v2.0.5.csv file. I used the following commands below to successfully construct a classifier based on this data:
Step 1: make taxonomy file:
The commands are, generating an empty file called nifH_dada2_v2.0.5_tax.tsv, then adding the required headers"Feature ID (tab space) Taxon". Then we are skipping the first line of the CSV file and writing out columns 1 (label) and 5 (taxonomy) delimited by a ,. After we have this, we then replace the , with a tab-character.
Step 2: make FASTA file
Here we perform similar tasks as with the taxonomy, except we extract columns 1 (label) and 2 (sequence). Then we write to file with a > in front of each label, then the sequence below each label
tail -n +2 nifH_dada2_phylum_v2.0.5.csv | cut -f 1,2 -d ',' | sed 's/^/>/' | sed 's/,/\n/' > nifH_dada2_v2.0.5_seqs.fasta
Thanks so much for your (quick!) response. This worked perfectly, thank you.
I'm now struggling to get many assigned taxa to my representative seqs from the dada2 output. I have 6038 unique ASVs, but after taxonomic assignment, I'm only getting 20 taxa present in my bar plots, even at species level (taxonomic level 7). I was also missing a lot of species I was expecting to see, for example Cyanobacteria were relatively high in my bacterial (16s) ampicons, but were only >1% of nifH reads.
As all the sequences in this database are just the nifH sequence, I didn't think the Extract-reads step was necessary. Is this correct? I did try to run this but it ran infinitely.
After training the classifier, I assigned taxonomy using the following;
What primer pair did you use to generate the data? Many primer pairs are "leaky" and can amplify off-targets. This is quite common.
So, what do you mean by "only 20 taxa", do you mean only 20 sequences are being classified, or do you mean that many sequences are being classified to 20 taxa? Are you sure the primer pair you are using can equally amplify all the taxa of interest? or are even expected in those samples?
Did you remove chloroplast and mitochondria from your 16S data as outlined here? The reason why I suggest this, is that Chloroplasts evolved from Cyanobacteria. If you only observe the phylum-level taxonomy you might mistake the chloroplast sequences as cyanobacteria, as chloroplasts are defined as being within the phylum Cyanobacteria. Which could partly explain the lack of correspondence between the 16S rRNA gene and NifH data.
Also, keep in mind, not all primer sets are equally efficient in an eDNA setting. It is common to see differences in the number of reads generated, by one primer set (e.g. 16S) versus another primer set (e.g. NifH) for a particular taxonomic group.
All excellent points, thank you. I used the very degenerate DVV/IGK3 primers, so I was expecting some amplification of off-targets. That being said, I was also expecting a greater diversity of taxa after taxonomic assignment.
Re 20 taxa; I have 6000 ASVs which were classified to just 20 taxa. This includes taxa which have been assigned to just a high taxonomic group (e.g. Clostridia class) and lower groups within that class, e.g. Lachnospiraceae and Clostridiaceae families. The below bar-chart is of species level for all my samples.
Also, I'm unsure how many of my total sequences were actually assigned to taxa, this might be a next step. Is there a way I could check this? It'd be nice to know as a percentage how many sequences/ASVs were able to be assigned a taxonomy.
If you look at the NifH database, you'll notice that the taxonomy only goes to genus level. Thus, I do not think it is possible to obtain species-level assignments with this database, as far as I can tell.
The only thing that might be worth looking into, is to see if your sequences were generated in a mixed orientation? See here. It is not an easy fix, but you can start with vsearch as described in that thread.
I'm not as familiar with the ability of NifH to taxonomically resolve taxa. Perhaps others on the forum can point you in a better direction.
Thanks for the suggestions! I'll take another look and see if the sequences were generated in a mixed orientation.
On a separate note, is it possible to show the proportion or percentage of sequences within a sample that were successfully assigned a taxa during the 'qiime feature-classifier' step? This will confirm to me that the taxa seen in the barplots are indeed the true nifH community, whereas I currently don't know whether it's just that the majority of my sequences were unable to be assigned a taxonomy. Does this make sense?