Training Silva 132 classifier for qiime2-amplicon-2023.9

Based on what I've been able to extract from the reference database, I'm not surprised.

To make sure I understand correctly, are you saying that you are able to identify most of your 18S sequences as Eukaryotes using the full length classifier? It's just the amplicon specific one is not working well?

Most of the time the differences are not substantial. But generally speaking, most report improved classification when training a classifier based on the amplicon region. However, it appears that a substantial portion of the region you need to construct an amplicon specific classifier is not completely present in the database, thus reducing the effectiveness the classifier. In these cases, we suggest using this approach, as I mentioned earlier. But it is understandable why the full-length classifier might be better in this case, as your sequences are hitting the partial region of the full length sequence, that is removed from the PCR-primer based extraction method.

Currently, I think we use the pipeline as presented, for the full length and ampicon region. Again, we provide those pre-made classifiers as a convenience, and might further optimize those classifiers in the future (some of us devs have been discussing some ideas on this). But generally, you can curate the reference database as you like. In fact, you can simply make use of the full length SSURef_NR99 as not perform any curation at all. Though I'd recommend at least dereplicating the database, to shrink the file and memory sizes for the classifier.

Not currently. But you should be able to use several of the evaluate functions in RESCRIPt, to compare the taxonomy files (the ones used as input to the classifier). See the SILVA tutorial for examples.

1 Like