I’m working in qiime2-2020.8 and I want to use ASVs database which was created by @benjjneb (Silva 138.1 prokaryotic SSU taxonomic training data formatted for DADA2 | Zenodo). The first file (silva_nr99_v138.1_train_set.fa.gz) seems to be a classifier type, but I don’t know if it is possible to import this to QIIME2. My partners work with this file in RStudio, but i want to know if I can use it in QIIME2. Thanks in advance for any help.
Also, I want to know if I’m right thinking that Silva 138.1 database is an OTU-based one, but not ASV-based. So my taxonomic assignment could be better if I use ASV-based database instead of the other. In theory, OTUs are groups of biological sequence so Dr. Callahan released the sequences within OTUs. Am I right?
Hi @fellora ,
From the description there, it sounds like this is just a version of the SILVA 138 bacterial sequences that is formatted for use with dada2 in R.
These will not be exactly the same as what dada2 and SILVA provide: we perform some additional QC steps as described in this tutorial:
Yes, the NR99 sequences are clustered at 99% to reduce redundancy. It is possible to create a 100% unique (ASV) database using RESCRIPt in QIIME 2, as described in that tutorial above.
that level of resolution might not matter for taxonomy classification. The idea of ASVs is to allow mapping of unique sequence variants that theoretically represent subspecies-level variation. Taxonomic classification smooths over this variation to some degree, by mapping ASVs or OTUs to the nearest known reference taxonomy. An ASV database would most likely be redundant unless if the subspecies-level variants are annotated as such (e.g., strain ID) — though this would not be practically useful either since 16S (even full-length) does not fully resolve at subspecies level (this is why I refer to ASVs as "theoretically" subspecies variants — they are, but you cannot use 16S to distinguish true strains).
But you could certainly build a SILVA ASV database following the tutorial above, and test the level of resolution you get on your own data
Really awesome!! I and my short peruvian group are going to work with RESCRIPt to create a costumized database. But, our weakness is that we don't have a server so we are working at google cloud. Recentely, a RESCRIPt docker container was created by my collaborator (GitHub - gadgrandez/qiime2-rescript) and we are going to compare with other methods. Thanks a lot!!
Awesome!! So if we would improve the converted ASV-based SILVA database with a curated species-level NCBI-RefSeqs with RESCRIPt and, then, we add ASVs from other 16S SRA studies with their taxonomy assignation, so we can improve my taxonomic resolution and assigment for my specific ecosystem!!
that might be a job for q2-clawback (to weight taxonomic classificaiton by the likelihood of detecting specific taxa, instead of trimming a database, which can lead to misclassification issues).
Thanks a lot for the idea. I've read RESCRIPt's paper and taken a look in q2-clawback, I definitely want to use it. However, I'm blocked trying to convert OTU-based SILVA database. You mentioned that It is possible to do this in Rescript, but I don't have any idea how to start doing this (What command line?). Because the paper mention all the procedure in OTUs. Could you please give me some advice??
Hi @fellora ,
The tutorial above shows how to make such an ASV database (dereplicate but do not cluster the sequences). Likewise, this is how the sequences and pre-trained classifiers shared on the QIIME 2 data-resources page are created.