I'm testing Naive Bayes taxonomic taxonomic classification of V3V4 data using Greengenes2 vs Silva. I'm a bit uncertain about these details:
Should I use the GG2 backbone data (which, to my surprise, seems to be SMALLER than Silva!) or the full data?
2022.10.backbone.full-length.fna.qza + 2022.10.backbone.tax.qza
2022.10.seqs.fna.qza + 2022.10.taxonomy.asv.tsv.qza
Using Rescript for dereplication after V3V4 truncation, I noticed that the 'silva' option apparently doesn't exist anymore for --p-rank-handles. Can I just omit the whole thing (for Silva and/or GG2)?
Welcome to !
I'll leave it for someone else to answer the Greengenes2 portion of your question. But I can at least help answer this:
The option is still there, well in a different form anyway. Originally
--p-rank-handles simply allowed you to choose between 'silva', 'greengenes', and 'gtdb' like taxonomy ranks. Which historically made use of 6-7 ranks. But as many use RESCRIPt for many other marker genes, we've opened up this parameter so that any set of commonly encountered ranks can be used. These ranks are listed in the
--help text. We've set the default to use the following ranks:
'domain', 'phylum', 'class', 'order', 'family', 'genus', 'species'
which basically uses the traditional 6-7 ranks of 'silva'. Note,: SILVA actually has up to 14 different ranks!
So, if you would like to stick with the standard 6-7 ranks for SILVA, then use the default command. Otherwise you can list all of the ranks you'd like to use, and that are available for that database. For example, let's imagine that you downloaded data from GenBank, and you'd like to add a few more ranks to your taxonomy, e.g.
subclass, and not make use of the
species label. Then, you'd use the following parameter like so:
--p-rank-handles domain phylum subphylum class subclass order family genus \
Again, make sure any ranks that you choose are actually available within your reference database. For example, the old Greengenes used
Kingdom instead of
Domain. I do not think this is the case for Greengenes2. But again, I'll defer to the Greengenes2 experts.
Hi @MikaelNiku, to quickly follow up, you can follow this approach for Greengenes2. But I'll leave it to others to add to this.
@MikaelNiku, the post @SoilRotifer makes sense.
SILVA and Greengenes2 have different requirements on QC. For Greengenes2, we opted to limit the set of full length 16S to specific records sets which we have expectation of low chimera rates and high quality base calls. There is quite a bit of noisy and chimeric public records which are a notorious challenge for databases. We're working on expanding the set of full length 16S considered right now. Please note that more 16S does not inherently mean a greater breadth of diversity is covered.