Pre-trained silva classifier (V3-V4) qiime 2021.4

SoilRotifer · August 21, 2021, 10:29pm

If you read the information in the link that @anwesh provided, i.e. the NR99 database, you'll see that this is a curated "non-redundant" reference database. Note, it says:

By applying a 99% identity criterion to remove highly similar sequences using the vsearch tool ... Sequences from cultivated species have been preserved in all cases....

These are the base files from which we've made a few classifiers available on the Data resources page, and also appears to be what @anwesh is using as well. However, you can use RESCRIPt to download and parse the full database instead of the NR99 if you'd like. But I'd recommend some quality control be performed, as outlined in the RESCRIPt tutorials.

Anyway, the curated SILVA NR99 database is helpful, as it reduces the size of the database, such that you can run it on machines with less memory, etc... You obviously save even more memory by extracting only your region of interest from that too.

SILVA NR99 is 99%

If possible, it is best to try and use reference reads that have been minimally clustered or simply dereplicated, i.e. at 99% or 100% similarity. As you'll be able to more accurately classify your reads, that is you have more reference data to use. However, more reference data is not necessarily always better, if not well curated. That is reference sequences may contain poor quality sequences or are incorrectly annotated. Which can negatively impact your ability to classify your reads.

Anyway, the old way of pre-clustering reference reads to 97% or 94% sequence similarity (as done with Greengenes and SILVA) was simply a logistical way to reduce the size of the reference database even further, so that it'll run on computers with limited memory and cpu power. But this comes at a potential cost: a decreased ability to classify some reads. That is, the resulting clustered representative sequences/OTUs may end up with a truncated taxonomy... e.g. based on the lowest common ancestor (LCA) of the sequences that are contained within that cluster / reference OTU. Imagine that the sequences within the 97% OTU cluster are all from different genera... the new reference OTU would have a taxonomy that only goes to family. There are other methods (checkout the rescript dereplicate command), but I wont go into them here.

If you are referring to your own sequences, assuming you've run through DADA2 or deblur, then your sequences are simply Exact / Amplicon Sequence Variants (ESVs)... i.e. denoised 100% OTUs.

I hope this helps!