Taxonomic Data: Using 18S-Nemabase in Qiime2 instead of Silva?

mkweber · January 8, 2024, 8:16pm

Hello!

I am doing a metagenome nematode stool study and would like to use the 18S-Nemabase database in concert with Qiime2 instead of Qiime's Silva classifiers. It has higher resolution for nematode sequences and was adapted from the latest version of Silva.

"To test the efficacy and accuracy of 18S-NemaBase, we compared it to an older but also curated SILVA v111 and the newest SILVA v138 by assigning taxonomies and analyzing the diversity of a nematode dataset from the Western Nebraska Sandhills. We showed that 18S-NemaBase provided more accurate taxonomic assignments and diversity assessments than either version of SILVA, with a much easier workflow and no need for manual corrections. "

Here is a link to the publication: 18S-NemaBase: Curated 18S rRNA Database of Nematode Sequences - PMC

It would be so helpful to get some support on this topic!

SoilRotifer · January 8, 2024, 11:38pm

Hi @mkweber,

Thank you for the reference! I've not used NemaBase, but based on a quick skim of their paper it appears that NemaBase is curated to optimize nematode taxonomy assignment. I think it should serve you well.

It wouldn't hurt to compare to SILVA 138.1, you'd have to follow the SILVA RESCRIPt tutorial yourself to see if there have been any changes, but I doubt there would be any significant changes.

You can even use GenBank data to make your own database too, and compare... for example there is this tutorial, and this one too.

Just make sure that the NemaBase, or any "specific" database you make use of, has "outgroup" taxa. That is, a collection of non-Nematode sequences within the reference database. If there are no outgroups, then quite many sequences will be classified as "Nematodes" when they are not. This is often overlooked.

But I think you'd be just fine using NemaBase.

mkweber · January 9, 2024, 3:15pm

So after looking through the database curation section of the paper, it appears that NemaBase only includes Nematodes -- How would I go about adding outgroups? what outgroups species should be included here?

SoilRotifer · January 9, 2024, 8:10pm

You could simply download the premade SILVA sequence and taxonomy files from the Data resources page, and filter the sequences using qiime taxa filter-seqs ... like so:

If we just want to provide only eukaryotic outgroup taxa, but no SILVA nematode sequences you would run the command below. So, we are removing all Bacterial and Archaeal sequences. We'll also remove the Nematoda, as we do not want to pollute our new NemaBase database. You can also simply just remove the Nematoda, and leave the Archaea and Bacteria there as decoys / outgroups. You may have to play around and see which works best.

qiime taxa filter-seqs \
    --i-sequences silva_sequences.qza \
    --i-taxonomy silva_taxonomy.qza \
    --p-exclude Nematoda,Bacteria,Archaea \
    --o-filtered-sequences  silva_euk_outgroup_seqs.qza

Now that we have our filtered SILVA database, we'll want to remove the taxonomy information for those sequences we removed. So, we can run:

qiime rescript filter-taxa \
    --i-taxonomy silva_taxonomy.qza \
    --m-ids-to-keep-file  silva_euk_outgroup_seqs.qza
    --o-filtered-taxonomy  silva_euk_outgroup_taxonomy.qza

Now we have two files we can merge to our NemaBase files: silva_euk_outgroup_seqs.qza and silva_euk_outgroup_taxonomy.qza. Assuming you've been able to import the NemaBase files.

Now we can merge:

qiime feature-table merge-seqs \
    --i-data  silva_euk_outgroup_seqs.qza  nemabase_seqs.qza \
    --o-merged-data nemabase_w_silva_euk_outgroup_seqs.qza

qiime feature-table merge-taxa \
    --i-data silva_euk_outgroup_taxonomy.qza nemabase_taxonomy.qza \
    --o-merged-data nemabase_w_silva_euk_outgroup_taxonomy.qza

Assuming there are similar taxonomic rankings (d__,p__,c__,o__,f__,g__,s__) you should now be able to train your classifier. If needed you can play around with qiime rescript edit-taxonomy ... to fix things.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads nemabase_w_silva_euk_outgroup_seqs.qza \
  --i-reference-taxonomy nemabase_w_silva_euk_outgroup_taxonomy.qza \
  --o-classifier nemabase_w_silva_euk_outgroup_classifier.qza

This should help you get started.

mkweber · January 9, 2024, 8:37pm

This is so helpful!! One issue -- NemaBase has sequence files and a tree file (although, it is not a Nexus file so I am having issues importing it) but no taxonomy file.

Here's a glimpse of the contents of the fasta file, and its taxonomy arrangement.

I pulled the taxonomy information from nemabase to create a reference taxonomy file, I formatted it so that it could (hopefully be compatible with qiime)

This taxonomy file is in BLAST format, but not in Qiime's preferred format. Will this still work or will I run into issues?

SoilRotifer · January 9, 2024, 10:21pm

Hi @mkweber,

Ahh...

We'll have to use RESCRIPt to redownload the SILVA database with the same taxonomic ranks that NemaBase uses (you can provide a list of these ranks to qiime rescript get-silva-data ... and it will pull those ranks). Check out the RESCRIPt SILVA tutorial I linked above. Then follow the instructions I posted earlier about making the 'outgroup' files.

Then we can use rescript edit-taxonomy ..., or equivalent, on both the SILVA and NemaBase taxonomy files to make the taxonomy ranks similar to one another (i.e. remove the rank prefixes from SILVA, or add them to NemaBase ) prior to merging them. For example, we'd have to remove NemaBase ranks like NA_superclass with our SILVA formatting of sc__, etc...

It shouldn't be hard to coerce these files to be compatible. I am pressed for time at the moment, so not sure when I can help work on this, but if someone else can help feel free to jump in!

SoilRotifer · January 9, 2024, 10:31pm

Hi @mkweber, another approach you might consider. Simply classify your data with SILVA 138, then remove any sequences that are not classified as Nematoda with qiime taxa filter-seqs ... I mentioned earlier. Then take those SILVA classified nematode sequences and re classify with NemaBase.

This will remove the extra work of formatting the databases to be compatible with each other for combining, and be quicker to do in the short term.

mkweber · January 10, 2024, 4:03pm

thank you for your help!! I'll send an update with a finalized code solution for any future qiime users who are interested in using NemaBase

nietof · February 7, 2024, 2:04pm

Hi
Just seeing this post. I also have sequence samples for 18S using nematode primers that I need to classify. Is there an already built silva 138 classifier or do I have to build it using RESCRIPt and the sequence and taxonomy files?
@mkweber, were you create the sequence and taxonomy files for 18S NemaBase.
Thank you
Fernando

SoilRotifer · February 7, 2024, 2:14pm

Hi @nietof,

You can do either. If you'd like to use a premade classifier you can find them here along with the files used to train them.

nietof · February 10, 2024, 5:17pm

@SoilRotifer
I used silva 138 full length and I didn't get a single nematode hit, a few assigned to fungi, and a few assignments that made no sense, lots of unassigned features. For the libraries we used 18S primers from Kawanobe et al (2021) Applied soil ecology 166:103794, F548_A AGAGGGCAAGTCTGGTGCC , R915 TCCAAGAATTTCACCTC.
nem-taxonomy.qzv (4.3 MB)

SoilRotifer · February 10, 2024, 8:21pm

HI @nietof, here is a listing of sequences within SILVA for the Nematoda.

Given your output. I think your sequences are not in the same orientation as the reference database. This is common when using the trained classifier.

You can use RESCRIPt, to orient your FASTA / FeatureData[Sequences]sequences. There are two ways to make use of this action:

1) You can simply reverse compliment everything in your sequence file:

qiime rescript orient-seqs \ 
    --i-sequences nematode-seqs.qza \
    --p-threads 4 \
    --o-oriented-seqs  oriented-nematode-seqs.qza \ 
    --o-unmatched-seqs  unoriented-nematode-seqs.qza

^^Note the unoriented-nematode-seqs.qza will be empty as we are not using a reference.

2) try orienting the sequences that need reorienting using a reference (e.g. SILVA):

qiime rescript orient-seqs \ 
    --i-sequences nematode-seqs.qza \
    --i-reference-sequences silva-138-99-seqs.qza \
    --p-threads 4 \
    --o-oriented-seqs  oriented-nematode-seqs.qza \ 
    --o-unmatched-seqs  unoriented-nematode-seqs.qza

Note the SILVA reference sequence file can be downloaded from here.

For the moment, I'd suggest trying the first option, then try assigning taxonomy again. Let us know if this returns a better classification.

-Mike

nietof · February 14, 2024, 6:02pm

Hi Mike
I reoriented the sequences using your first script and there wasn't much improvement. I think it looks pretty much the same as the non-oriented sequences.
nem-reoriented-taxonomy.qzv (4.3 MB)
I will try the second script including the Silva sequences?
thank you
Fernando

SoilRotifer · February 14, 2024, 6:46pm

Thank you for the update, @nietof ! If you manually BLAST the sequences via the NCBI website, do you hit any target taxa you expect?

nietof · February 15, 2024, 12:49pm

I have blasted some of the features and none are hitting in the phylum Nematoda. I tried the second script reorienting the sequence using silva as the reference sequence database. I get much lower number of features and most of them are assigned to fungi, Ascomycota or Basidiomycota, which is expected. There are only a few unassigned features. There is a very weird assignment at the species level to Clostridium tetani but the phylum is a plant, Phragmoplastophyta. I also have some assigned to the bacteria domain. I think the nematode signal is so low in the soil that it gets overwhelmed with all the other more abundant organisms in soil. I am not sure if there is anything else I can do other than use an only nematode database and see what happens.
Thank you
Fernando
silva-oriented-nem-taxonomy.qzv (1.4 MB)

SoilRotifer · February 15, 2024, 3:47pm

My concern is the fact that when you BLAST,

This tells me that your data may not contain Nematoda, or at least very little of your target.

The reference to the primers you listed:

was helpful. From what I can tell the forward primer is actually from Hadziavdic et al. 2014, not Kawanobe et al 2021. Although Kawanobe et al. lists the reverse primer from Hadziavdic et al., I find no listing of that primer in the paper. In fact, the earliest reference I can find for the reverse primer are from these two papers: Willerslev et al. 1999, Medinger et al. 2010. Neither of these papers appear to have an explicit reference to this reverse primer sequence. I assume they obtained it from elsewhere. The name of this reverse prier appears to actually be "nu-SSU-898".

Anyway, the primers you are using appear to be generic primers, not specific to nematodes. If nematodes are your focus, then I'd suggest using nematode specific primers such as those from Sapkota & Nicolaisen 2015, or other specific primers.

I am sure others on the forum can help suggest appropriate primer pairs. Otherwise, I am not sure what the issue could be at this point. I'd continue manually running BLAST on your most abundant sequences, if they continually do not return nematode hits, but provide good matches to other taxa, then it is likely an issue of primer choice for these samples. That is only a guess at this point.

nietof · February 16, 2024, 1:59pm

Mike @SoilRotifer
Thank you, that was very helpful. This was my first time trying to look for nematodes using metabarcoding. I will reach out to @mkweber who was in the forum asking about the 18S nembase clasiffier.
BTW, how do you tell if sequences are oriented correctly?
Thanks again for all your help
Cheers
Fernando

SoilRotifer · February 16, 2024, 2:20pm

The easiest way is to look at the orientation of the query against the reference when looking at the BLAST output. If you see the BLAST alignment position numbers of the query and the reference ascending / descending in the same direction then both sequences are in the same direction. Whereas, if the position numbers of one sequence ascends, while the other descends, then that tells you that BLAST had to flip (reverse compliment) one of the sequences in order to align them together.

dervishcarving · July 23, 2024, 3:46pm

Hi Margo. did you ever get the Nemabase into a format that Qiime2 would recognise? If you did, any chance you could post the database?
Ive been having the same problem with the Silva database and, eventually, I used the NemaTaxa database. it has excellent taxonomy BUT only goes down to genus level, no species data included.
I did notice one potential issue with Nemabase though, doesn't it have no outliers? so how does it classify non-nematode DNA when it finds it? (no primers are perfect, you are always going to get non-nematode DNA from environmental samples)
Thanks
Dave (a very inexperienced Qiime user, but Im learnign as fast as i can)