Hi @hdeel,
Following my post yesterday I had a chat with Daniel and learnt some things (thanks Daniel). I will attempt to summarise those things here. Sorry if I make mistakes.
V4 ain't V4
The Greengenes2 folk have compiled an enormous database of V4 reads that can be used for taxonomic classification by exact matching, a procedure they have coined phylogenetic taxonomy.
The (unweighted) V4 Naive Bayes (NB) classifier provided on the data resources page was not trained on that enormous database, but rather on V4 reads extracted from a less enormous database of full-length 16S data.
To our knowledge, no-one has attempted to train a machine learning classifier (NB or otherwise) on the enormous V4 database.
Phylogenetic Taxonomy
Phylogenetic taxonomy can be achieved by running qiime greengenes2 filter-features
followed by taxonomy-from-table
. Further context is here.
Weighted Naive Bayes Classification using Greengenes2
You can use the full-length or V4 databases that were used to train the NB classifiers on the data resources page to train weighted NB classifiers.
The classifiers on the data resources page were created using this script.
So you could obtain full-length 16S sequences and the corresponding taxonomy from the current Greengenes2 release by running
wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza
wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.tax.qza
At this point, you could proceed from this step in the clawback tutorial to create weighted Greengenes2 classifiers for both full-length 16S and V4 sequences.
(I would probably choose the Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b
context to use with qiime clawback assemble-weights-from-Qiita
at this point in time.)
I hope that helps, please don't hesitate to get back to us if you'd like more help.
Ben