Greengenes2 and q2-clawback

BenKaehler · March 14, 2024, 7:34pm

Following my post yesterday I had a chat with Daniel and learnt some things (thanks Daniel). I will attempt to summarise those things here. Sorry if I make mistakes.

V4 ain't V4

The Greengenes2 folk have compiled an enormous database of V4 reads that can be used for taxonomic classification by exact matching, a procedure they have coined phylogenetic taxonomy.

The (unweighted) V4 Naive Bayes (NB) classifier provided on the data resources page was not trained on that enormous database, but rather on V4 reads extracted from a less enormous database of full-length 16S data.

To our knowledge, no-one has attempted to train a machine learning classifier (NB or otherwise) on the enormous V4 database.

Phylogenetic Taxonomy

Phylogenetic taxonomy can be achieved by running qiime greengenes2 filter-features followed by taxonomy-from-table. Further context is here.

Weighted Naive Bayes Classification using Greengenes2

You can use the full-length or V4 databases that were used to train the NB classifiers on the data resources page to train weighted NB classifiers.

The classifiers on the data resources page were created using this script.

So you could obtain full-length 16S sequences and the corresponding taxonomy from the current Greengenes2 release by running

wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza
wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.tax.qza

At this point, you could proceed from this step in the clawback tutorial to create weighted Greengenes2 classifiers for both full-length 16S and V4 sequences.

(I would probably choose the Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b context to use with qiime clawback assemble-weights-from-Qiita at this point in time.)

I hope that helps, please don't hesitate to get back to us if you'd like more help.

Ben