Using silva-138 pre-trained classifier

Mudit_Bhatia · July 17, 2024, 10:49pm

Hi,
Hope you are doing well. Myself and a collaborator have been working on comparing data processed between qiime2 and Mothur for 18S sequencing. For the same, the data was aligned and classified in Mothur using the Silva 138.1 database whereas in qiime2, the command used was

qiime feature-classifier classify-sklearn ......

Both of us have successfully been able to classify the data and generate taxa-bar-plot, however when we compared the data, we found some differences.

Specifically for phylum Chlorophyta, we observed discrepancies in further classifications with qiime2 classification very different than Mothur.

Therefore, to troubleshoot this issue we performed the following steps:

Rep-seqs file generated after dada2 cleanup was extracted and BLAST was performed in NCBI. This particular sequence was a member of the genus Desmodesmus whereas in qiime2 it was classified as d__Eukaryota; p__Chlorophyta; c__Chlorophyceae; o__Chlorophyceae; f__Chlorophyceae; g__Chlorophyceae. It was noted that no member of Desmodesmus genus was found when classification was performed in qiime2 whereas it was observed in some abundance when classified in Mothur
As the first method was very limited to the number of sequences, we also performed another troubleshooting. We extracted the chlorophyta sequences from Mothur as a fasta file and used that as an input for classification in qiime2. I am attaching a CSV file which correlates the sequence with its classification in both Mothur and qiime2
https://drive.google.com/file/d/1J_oxsUgoBs9_m-zoTKSJ70FG4RSVqq7H/view?usp=sharing

We observed that classification using Mothur correlated well with the BLAST conducted on NCBI website but that correlation was not observed with qiime2. Could you please help us understand why we observed these differences using the same classifier (atleast what we understand)?

Also, I should have mentioned it earlier but the qiime2 version I have used for this exercise is qiime2-amplicon-2023.9.

Thank You for your help.

Mudit_Bhatia · July 17, 2024, 10:58pm

Hi,

Hope you are doing well. Myself and a collaborator have been working t understand the differences in data analysed by Mothur and qiime2 for 18S sequencing.
For qiime2, we used cutadapt to trim the primers, and then dada2 to generate the representative sequences file. When this file was extracted, specifically for phylum chlorophyta, we observed in total of 442 sequences. However, similar process in Mothur and extraction resulted in almost 42000 sequences. Therefore, we wanted to understand if qiime2 compresses the different sequences using some way resulting in only 442 sequences as compared to 42000 for Mothur.

Thank You for your help.

SoilRotifer · July 19, 2024, 4:45pm

Hi @Mudit_Bhatia,

Let's see if we can figure this out. First let me clarify a few things:

Even though both tools use SILVA, this does not necessarily mean the databases are the same. For example, the pre-made classifiers from QIIME 2 might be the SILVA 138 and not the latest 138.1 version, of which there were some changes.
QIIME 2 & mothur also curate the SILVA database differently, which means some reference sequences may have been discarded or renamed between the two. For example, pre-made classifiers from QIIME 2 generally follow this approach. There might have been some aggressive culling of the reference data, perhaps removing too many eukaryotes, but much of this was to just provide an example of what a user can do to curate their database. But how a database is curated can have large effects on how well the classifiers work. I'd suggest running through the linked tutorial to make your own SILVA database by simply running these commands :
a) qiime rescript get-silva-data ...
b) qiime rescript reverse-transcribe ...
c) qiime rescript dereplicate ...
d) qiime feature-classifier fit-classifier-naive-bayes ...
This will give you an "unedited" version of SILVA 138.1 by which to compare, as none of the data quality or sequence removal steps would have been run. Give this a try and see what happens.
AFAIK, each tool uses a different classifier, mothur uses KNN, and QIIME 2 uses naive bayes. Though I am not sure how much of a difference this will make.

Keep us posted as to the outcome.

Mudit_Bhatia · July 20, 2024, 10:19pm

Thank You Mike for the directions.
Yes, the version of qiime2 I was using, used silva 138 and not 138.1
When I trained Silva 138.1 using the directions you provided, the results now look much comparable.
However, even with the completely unedited version, it does not result in the last classification till Desmodesmus (kind of the organism of focus). I will try performing all the steps in the approach to see if I can get there. If you have any other suggestions on any other steps I can follow, do let me know.

Please find attached the link to the results https://drive.google.com/file/d/1hWFLvjIJ2av6veC-XHGj8ecR_DhtG5wH/view?usp=sharing.

Regards

SoilRotifer · July 22, 2024, 9:49pm

What primers / amplicon region are you amplifying? V3V4?

Also, I noticed that there are only ~17 reference sequences in nr99 SILVA reference database for Desmodesmus, and just a few more for the full database. There also appears to be less references within the new SILVA v138.2? Currently, there is no auto download for 138.2, so you'll have to manually follow the instructions under Getting SILVA data: Hard Mode .. .gritty details (click the triangle / drop menu). Then see if 138.2 does any better.

Also, I forgot to mention that the our plugin, by default, downloads the non-redundant SSURef_NR99 SILVA database. You may opt to change the options in get-silva-data to download the full database, SSURef.

-Mike

Mudit_Bhatia · August 12, 2024, 8:17pm

Thank You Mike for reaching out and sorry for the delay in getting back.

For the Mothur classification as well, we are using the prepared database as mentioned on the website. This classifies all the concerned Desmodesmus to the genus level. However, when we use the 138.1 classifier after training with rescript, it only drops us to the family level Sphaeropleales.

We performed BLAST for the individual sequences which tell us that the sequence belongs to Desmodesmus with a very high Percent identity and coverage.

Our major concern is if its just the classification algorithm which is doing this, or if there is any fundamental error we are making which we might not be aware of.

Thank you for your guidance and help

SoilRotifer · August 12, 2024, 9:50pm

Are there any sequences that are not Desmodesmus in the BLAST output, but are still a good hit? If so, that is probably why the feature classifier is returning family-level. I ask this because I did actually run BLAST on one of the sequences and there where other taxa that appeared as equally probable as a hit, albeit most where Desmodesmus:

Desmodesmus armatus
Desmodesmus sp.
Auxenochlorella pyrenoidosa
Scenedesmus sp.
Scenedesmaceae sp.

They looked very similar to one another...

There are several options on how to dereplicate your reference data with RESCRIPt... The default is to remove any sequences that are identical, unless they have a different taxonomy i.e. --p-mode uniq. This allows us to make weighted classifiers, see here. If the uniq approach as used, and you are trying to identify a query sequence, and that query hits against identical sequences with different taxonomic labels.... Then the classifier will likely decide to return the lowest common ancestor. In this case, the family. Assuming there isn't numerically more hits to one taxon over another wihtin the representatives in the database...

You could opt to not dereplicate the data and train your classifier, or simply try rescript dereplicate --p-mode majority... to make use of the taxonomy most commonly present for a given sequence.

I doubt it has anything to do with the classifier, but more with how the reference database is made / curated, as I mentioned earlier. There are quite a few differences in how the two reference databases are constructed. RESCRIPt's rational is for the user to consider the appropriate form of database curation. Also, there are known taxonomic inconsistencies within SILVA. This is also highlighted in the mothur curation documentation you linked.

I'll keep thinking on this. If anyone else has any thoughts or suggestions, please feel free to join in.

system · September 13, 2024, 3:50am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.