Is there an issue with the 2024.5 silva138-99-nb classifier?

Hello QIIME team,

Thanks for developing such a useful analysis package.

I observed different classification outputs with the new silva-nb classifier when I was running the feature-classifier with different silva full-length classifiers.

My dataset was generated using PacBio machine. Used the HiFi_16S nf workflow. I believe that the HiFI_16S workflow uses VSEARCH to classify the sequences.

Taxa plots of VSEARCH output:
taxonomy_barplot_vsearch.qzv (630.9 KB)

I was more interested in NB classifiers, so I used the DADA2 output from that workflow and tried with a self-trained full-length classifier.

Taxa plots with self-trained classifier:
3_taxa_plots.qzv (1.3 MB)

Since that classifier was generated using 2023.05 version, I thought of using the latest classifiers from the resources page. As expected, there was scikit-learn version error. I downloaded the recent QIIME2 2024.10 version and tried the classification. Well, the results were pretty different with the recent 2024.5 classifier.

Taxa plots with 2024.5 classifier:
3_taxa_plots_202405.qzv (1.0 MB)

I tried with the environment-specific classifier. That gave a similar output as the self-trained classifier.

Taxa plots with 2024.5 weighted classifier:
3_taxa_plots_202405_env.qzv (1.5 MB)

I thought there was an issue with my commands. So, I tried the 2021.4 classifier from the resources page. The output was similar to that of the self-trained classifier.

Taxa plots with 2021.4 classifier:
3_taxa_plots_2021.qzv (1.3 MB)

So, when I checked the file size of each classifier, the 2024.5 classifier seems to be only ~220 MB, whereas the others were ~500 MB. Tried the SHASUM and the output matched with that mentioned on the page.

$ shasum -a 256 silva-138-99-nb-classifier_202405.qza 
# c08a1aa4d56b449b511f7215543a43249ae9c54b57491428a7e5548a62613616  silva-138-99-nb-classifier_202405.qza

So, is there an issue with that file?

Regards,
Anwesh

Hi @anwesh,

It appears that you've mixed and matched different versions of different classifiers. For example:

file database Notes
taxonomy_barplot_vsearch.qzv GTDB v207
3_taxa_plots.qzv SILVA v138
3_taxa_plots_202405.qzv SILVA v138.1
3_taxa_plots_202405_env.qzv SILVA v138 (weighted?) Messy provenance graph.
3_taxa_plots_2021.qzv SILVA v138

I'd make sure you use the same classifier for all your comparisons. SILVA has recently been updated to 138.2. I'd suggest using this or another database consistently. For example, GTDB does not contain any eukaryotic sequences and is much smaller than SILVA & GreenGenes 2.

If you'd like to stick with SILVA, you can use the latest version of RESCRIPt, that comes with QIIME 2 (2024.10). You can use the premade files or use RESCRIPt to curate your own version of SILVA and / or GTDB, RESCRIPt now defaults to SILVA v138.2 and GTDB v220.0.

I am not sure what is going on with the provenance graph for 3_taxa_plots_202405_env.qzv but it is indecipherable to me. I think much of this is from the generation of the weighted classifiers,as it pulls from QIITA, etc...?

Anyway, my suggestion would be to make sure you are using the same reference database prior to making comparisons.

2 Likes

Hi @SoilRotifer,
Thanks for the clarification.

May be, I did not put my question correctly. Please ignore all visualizations, except these two: ...202405.qzv and ...202405_env.qzv (weighted).

Were these two classifiers from the QIIME2 resources page generated using different versions of SILVA?

The mentioned two taxa bar plots were generated using the output of these two classifiers (same dataset or ASVs). QIIME2 version used was 2024.10.

As the plots show, ...202405.qzv has ~55% Bacteria, ~20% Unassigned, and ~15-20% Eukaryota. Whereas the ...202405_env.qzv has almost 50% each of Bacteria and Archaea. Whether so much variation is expected?

Hope my question makes sense.

Thanks, I am doing that right now.

Regards,
Anwesh

1 Like

Hi @anwesh,

We try to update the premade files for each QIIME 2 release, but this may not always be the case. The provenance information contained within the downloadable premade files will tell you which version of SILVA was used to generate the database. Just drag and drop into QIIME 2 View, and click on Provenance tab.

This is the beauty of RESCRIPt, the information on how the reference database was made will be included in your analysis. Note the version information in the screenshot below when I used QIIME 2 view on your 3_taxa_plots_202405.qzv file.

Also I just want to point out some of my thoughts here. It does focus a little on UNITE, but the general perspective still applies.

2 Likes

As I outlined in the table above, two different versions of the SILVA db were used for generating the outputs viewed within ...202405.qzv and ...202405_env.qzv respectively. Also, the weighted classifiers are generated quite a bit differently than the standard classifier. So, it depends on what you expect to see given knowledge of your samples.

I defer to others that know more about how the weighted classifiers are made. Finally, I've never worked with PacBio data so, I do not know what to expect. Perhaps others on the forum can help here as well?

2 Likes