Greengenes2 and q2-clawback

hdeel · March 12, 2024, 5:54pm

Hi Q2 folks!

I am interested in comparing the new Greengenes2 taxonomic classification with SILVA for my dataset (soil) to see which one classifies more reads. It is paired end 515F/806R data. For SILVA classification, I'd like to use the pre-trained weighted classifier, and for a more apples-to-apples comparison, I'm wondering if I can (or even should) generate a weighted classifier using the Greengenes2 full-length sequences (full length because I want to use my paired end data rather than just the forward reads).

For those who know more about Greengenes2 than I do, would it be advisable to make a weighted classifier using q2-clawback on the Greengenes2 full length sequences? The data resources page still has the Greengenes 13_8 weighted classifier, is this due to a specific reason or has it just not been updated yet? Alternatively, I can just compare the weighted SILVA taxonomy with the non-V4-16S Greengenes2 pipeline without weights.

Thank you all for all you do for the QIIME2 community!

Heather

wasade · March 13, 2024, 6:29pm

Hi @hdeel,

We have not compared weighted classifiers with the default NB classifiers, or the phylogenetic taxonomy, from Greengenes2 so at the moment I'm unaware of data to guide here. If building a weighted classifier, I would guess utilizing the ASVs rather than full length would provide more power as they represent substantial V4 diversity, and the majority of ASVs have environment associations from their sample origins which could be used to subset the data to ASVs observed in soil.

Please note that we did observe an increase in correlation for paired 16S / WGS samples at genus and species using the phylogenetic taxonomy vs NB classification though (fig 2). Since your data are 515F/806R, I would recommend considering using phylogenetic taxonomy which is based on the coordinates of the already placed ASVs. The placed ASVs are inclusive of many soil samples, and from what we've seen, I would expect the majority of the sequence mass to be retained. However, we did not place the reverse read.

Best,
Daniel

BenKaehler · March 13, 2024, 8:13pm

Thanks @hdeel and @wasade,

I can confirm that the reason that we haven't provided Greengenes2 weighted classifiers is that we haven't updated the provided classifiers, and following this conversation it's on my (our, @Nicholas_Bokulich?) to do list.

Regarding whether you should use full length or 515f-806r data, @wasade brings to my attention that the Greengenes2 515f-806r data are derived in rather a different way to how we've derived them historically, so my guidance is to try both, if computational resources allow.

So please do go ahead and train weighted classifiers using Greengenes2. More guidance (and example scripts) is available at the readytowear repo, but please don't hesitate to ask questions if you get stuck.

Please note that pre-trained Silva 138.1 (I think the ones on the data resources page are Silva 138) classifiers for soil are available from Zenodo.

Ben

BenKaehler · March 14, 2024, 7:34pm

Hi @hdeel,

Following my post yesterday I had a chat with Daniel and learnt some things (thanks Daniel). I will attempt to summarise those things here. Sorry if I make mistakes.

V4 ain't V4

The Greengenes2 folk have compiled an enormous database of V4 reads that can be used for taxonomic classification by exact matching, a procedure they have coined phylogenetic taxonomy.

The (unweighted) V4 Naive Bayes (NB) classifier provided on the data resources page was not trained on that enormous database, but rather on V4 reads extracted from a less enormous database of full-length 16S data.

To our knowledge, no-one has attempted to train a machine learning classifier (NB or otherwise) on the enormous V4 database.

Phylogenetic Taxonomy

Phylogenetic taxonomy can be achieved by running qiime greengenes2 filter-features followed by taxonomy-from-table. Further context is here.

Weighted Naive Bayes Classification using Greengenes2

You can use the full-length or V4 databases that were used to train the NB classifiers on the data resources page to train weighted NB classifiers.

The classifiers on the data resources page were created using this script.

So you could obtain full-length 16S sequences and the corresponding taxonomy from the current Greengenes2 release by running

wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza
wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.tax.qza

At this point, you could proceed from this step in the clawback tutorial to create weighted Greengenes2 classifiers for both full-length 16S and V4 sequences.

(I would probably choose the Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b context to use with qiime clawback assemble-weights-from-Qiita at this point in time.)

I hope that helps, please don't hesitate to get back to us if you'd like more help.

Ben

hdeel · March 14, 2024, 8:21pm

Hi @wasade and @BenKaehler ,

Thank you both so much for your thorough and helpful answers! I'll let you know if I have any issues. You're all doing a great job with the transition to Greengenes2.

Heather