clarification/verification of taxonomy classifiers

SoilRotifer · May 20, 2022, 1:55pm

Hi @Rob_DNA, let's see if we can get you sorted.

This depends on which classifier you are using. The pre-made SILVA classifiers were constructed from the NR99 SSU version of the database, after some additional quality control via RESCRIPt, you can check out the tutorial. You can use the RESCRIPt plugin to download the full, raw SILVA database instead of the NR99.

Correct. The main downside, is a practical one... that is, the file size and memory footprint of the full-length classifiers can be quite large. Preventing their use on machines that lack appropriate resources. The amplicon-specific versions are much smaller. Depending on your taxa of interest, you might lose a tiny bit of classification accuracy compared to the amplicon-specific versions, but this has been minimal for the data sets that I've worked with. Though your mileage may vary. You can always compare the outputs.

Great questions! I refer you to these great papers. Several of these also discuss the benefit of constructing an amplicon-specific region classifiers. But in a nutshell, there are many cases in which several different organisms have identical DNA sequence over the amplicon region. Making it hard to disambiguate between taxa. That is BLAST, for example, might return all equivalent hits, which may not be helpful. We would prefer that consensus taaxonomy, or lowest common ancestor (LCA), be returned for our query sequence.

Yes. Although we provide the raw files that were used to make the classifiers, and the classifiers themselves (for the full-length and V4 region of the SSU gene), you can use RESCRIPt (linked above) to choose among several versions of the SILVA database and curate as you'd like. Many on the forum have made their own V3V4 classifier for example.

Yep.

Finally, RESCRIPt provides some tools to help you compare the various reference databases and classifiers you generate.

-Cheers!
-Mike