Does anyone trained the naive-bayes classifier based on Silva reference database (V3-V4 region) in qiime 2021.4 ? I ran taxonomy classification using full length pre-trained classifier from qiime2 resource page and it was killed due to memory error. I tried to train self-customized classifier as well, but the action was killed due to memory error as well. Thus, I would like to ask the good deed from qiime 2 users to provide the Silva-based classifier by which the reference reads were extracted by 341F & 805R degenerate primers.
Welcome to the forum, @Maryam_21!
I don't think the QIIME 2 team has the bandwidth to add another pre-trained classifier at this time, but here are a few different ways you might work around your memory challenges.
First, if any community members have already have a 341F/805R SILVA classifier they can share, that would be amazing!
If nothing turns up, you may be able to use RESCRIPt to build a custom classifier that does what you need. It offers a couple of different ways to filter and dereplicate the reference data while building an amplicon-specific database, which will in turn reduce memory requirements to build the classifier.
A colleague has been able to build amplicon-specific classifiers from SILVA with 16GB of RAM. If necessary, dropping species-level labels will help you reduce the size even more. If you can run the pre-trained v4 classifier locally, you probably have the resources to run your own custom v3v4 classifier.
If your hardware still isn't up to the task with RESCRIPt, you might consider using an institutional cluster, or renting a server for a short time (often surprisingly affordable). A search of this forum should turn up a few posts with server-rental resources.
Good luck!
CK
HERE are classifiers for V1-V2, V3, V3-V4 & V4 regions (silva 138).
Regards
Dear @anwesh ,
What is the classification similarity for your classifier? I am confufed a little bit because in the other platforms the classification similarity percentage is asked however, in QIIME2 we are not asked to determine classification similarity. Could you help me about this confussion?
Hi @mertcan ,
If you are implying to greengenes db with gg_85_otus, gg_97_otus, etc., I have used the ssu_nr99 of silva db.
Thank you for you fast reply!
Actually, I am beginner in this area. I would like to ask that, does "99" mean that the sequences which are 99% identical to the database will be assigned with a taxa?
My colleague uses mothur and she determines 97% or 99% classification similarity degree and she asked me about what is mine? I am searching the answer.
I hope I could explain myself.
Hi @mertcan,
If you read the information in the link that @anwesh provided, i.e. the NR99 database, you'll see that this is a curated "non-redundant" reference database. Note, it says:
By applying a 99% identity criterion to remove highly similar sequences using the vsearch tool ... Sequences from cultivated species have been preserved in all cases....
These are the base files from which we've made a few classifiers available on the Data resources page, and also appears to be what @anwesh is using as well. However, you can use RESCRIPt to download and parse the full database instead of the NR99 if you'd like. But I'd recommend some quality control be performed, as outlined in the RESCRIPt tutorials.
Anyway, the curated SILVA NR99 database is helpful, as it reduces the size of the database, such that you can run it on machines with less memory, etc... You obviously save even more memory by extracting only your region of interest from that too.
SILVA NR99 is 99%
If possible, it is best to try and use reference reads that have been minimally clustered or simply dereplicated, i.e. at 99% or 100% similarity. As you'll be able to more accurately classify your reads, that is you have more reference data to use. However, more reference data is not necessarily always better, if not well curated. That is reference sequences may contain poor quality sequences or are incorrectly annotated. Which can negatively impact your ability to classify your reads.
Anyway, the old way of pre-clustering reference reads to 97% or 94% sequence similarity (as done with Greengenes and SILVA) was simply a logistical way to reduce the size of the reference database even further, so that it'll run on computers with limited memory and cpu power. But this comes at a potential cost: a decreased ability to classify some reads. That is, the resulting clustered representative sequences/OTUs may end up with a truncated taxonomy... e.g. based on the lowest common ancestor (LCA) of the sequences that are contained within that cluster / reference OTU. Imagine that the sequences within the 97% OTU cluster are all from different genera... the new reference OTU would have a taxonomy that only goes to family. There are other methods (checkout the rescript dereplicate
command), but I wont go into them here.
If you are referring to your own sequences, assuming you've run through DADA2 or deblur, then your sequences are simply Exact / Amplicon Sequence Variants (ESVs)... i.e. denoised 100% OTUs.
I hope this helps!
Thank you very much for helping may I ask is it 99% or 97% for referencing
This really has no meaning, as I've outlined:
Furthermore:
So, if you are using reference reads that are 99% - 100% then you can use them to classify any reads that are clustered at different similarities, e.g. 97%, 94%, etc...
In fact, some consider classifying 97% clustered OTUs against a 97% clustered reference database not a great idea. That is, as your nearest reference sequence can be up to 6% away, further reducing classification.
Does this help?
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.