Introducing Greengenes2 2022.10

wasade · January 23, 2023, 6:12pm

Introduction

We are excited to announce a new release of the Greengenes database, full redesigned from the ground up, backed by whole genomes, with a focus on harmonizing 16S rRNA and shotgun metagenomic datasets. We're calling the new database Greengenes2 due to substantial changes in the design. However, like the original Greengenes, it relies on a de novo phylogeny, and expresses a taxonomy that is derived from the phylogeny.

The focus of this forum post is on how to use the new Greengenes2 database. Detail on the design and evaluation of the resource can be found in the associated Nature Biotechnology publication. We also provide an interactive website to explore the database. The raw files are released under the BSD-3 license and hosted on a public anonymous FTP. A QIIME 2 plugin for interacting with the database is also available.

Background

How is Greengenes2 constructed? Briefly, Greengenes2 relies version 2 of the Web of Life as a backbone phylogeny. The Web of Life is a phylogenomic tree based on 381 marker genes. The phylogeny is then updated, using uDance, with 330k unique high quality full length 16S rRNA from the Living Tree Project, 16S rRNA derived from full length ribosomal operons (data from both Earth Microbiome Project and human fecal samples), and the Genome Taxonomy Database. We then place >20M 16S rRNA V4 Deblur amplicon sequence variants into the tree, using DEPP, derived from over 300k public and private microbiome samples in Qiita. Finally, we construct a taxonomy by integrating the Genome Taxonomy Database with the Living Tree Project, and decorate this onto the phylogeny using tax2tree.

The result is a single phylogeny and taxonomy that is suitable for both short read shotgun metagenomics and 16S rRNA sequencing studies. It allows for directly integrating 16S rRNA and shotgun metagenomic datasets. In addition, we preserve the taxonomic curation from GTDB including its polyphyletic labelings. And, for studies focused on the V4 region of the 16S rRNA gene, taxonomy can be derived directly from the phylogeny without requiring use of Naive Bayes, which appears to yield a higher resolution result relative to Naive Bayes.

How can I use Greengenes2?

The easiest way to use Greengenes2 is through q2-greengenes2. The plugin provides operations for comparing your data against the resource. How you compare your data depends on the type of data that you have. We'll cover a few different scenarios below and provide specific examples of the interaction.

Before we move on, let's install the Greengenes2 plugin!

$ pip install q2-greengenes2

...the next time you run the "qiime" command, it will need to recache its environment which will take a few seconds.

For our examples below, we're going to obtain data from Qiita using redbiom as a way to demonstrate a few ways to use Greengenes2. Neither Qiita nor redbiom are required, they are merely a data source. On that note, let's go ahead and install redbiom:

$ pip install redbiom

If you have V4 data

Greengenes2 contains over 20,000,000 16S rRNA V4 amplicon sequencing fragments, derived from a dizzying collection of public and private microbiome samples in Qiita, representing a very large cross section of environment types. We chose to place V4 fragments, as opposed to other variable regions, as V4 is the predominant region within Qiita. (n.b., in the future we are considering how to place other variable regions).

If you have V4 data, then there is a very good chance the majority of your sequencing data are already represented by Greengenes2 -- this greatly simplifies interaction with the database: we just need to perform a set intersection between what's in your FeatureTable[Frequency] and what exists in the database!

Let's take a look at an example. To source data, we're going to rely on redbiom, which acts as a caching layer to Qiita and provides a command-line driven way to obtain FeatureTables and sample metadata. You can learn more about redbiom in our QIIME 2 tutorial.

First, we're going to set a "context" for redbiom, which represents a group of consistently processed samples (hint: you can find contexts with redbiom summarize contexts). Then, we're going to download the "extreme" microbiome study from Qiita, which is under study 2136. Last, we're going to import the data into QIIME 2:

$ ctx=Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b
$ redbiom fetch qiita-study \
>    --context $ctx \
>    --study-id 2136 \
>    --remove-blanks \
>    --output-basename icu
$ qiime tools import \
>    --input-path icu.biom \
>    --output-path icu.biom.qza \
>    --type FeatureTable[Frequency]
Imported icu.biom as BIOMV210DirFmt to icu.biom.qza

Next, we're going to filter our FeatureTable[Frequency] against Greengenes2. To do so, we're going to download a Phylogeny[Rooted] representation of the taxonomy. In Greengenes2, we provide both classic tab delimited taxonomy files (the ".tsv" file variant), and a tree representation. The tree representation is substantially compressed and simplifies some downstream computations.

The name of the file we're going to download is "2022.10.taxonomy.asv.nwk.qza", which means it is "taxonomy" data with the feature IDs represented as the actual amplicon sequence variants; the "nwk" indicates it is a NewickFormat internally, which is a way of representing tree structures. Additionally, we're going to use the variation of the taxonomy that encodes amplicon sequence variants as the ASVs themselves (as opposed to MD5 hashes). The reason we're using the "asv" representation is that, by default, redbiom outputs amplicon sequence variants as the sequences themselves.

What if my ASVs are hashed? Simple! Just use the "2022.10.taxonomy.md5.nwk.qza" artifact, and you'll be good to go.

NOTE: The filter-features command right now requires around 8-10GB of memory. While conceptually, were simply taking a set intersection of features, the dataset itself is quite large -- this is something we're looking at optimizing in the future.

$ wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.taxonomy.asv.nwk.qza
$ qiime greengenes2 filter-features \
>     --i-feature-table icu.biom.qza \
>     --i-reference 2022.10.taxonomy.asv.nwk.qza \
>     --o-filtered-feature-table icu_gg2.biom.qza
Saved FeatureTable[Frequency] to: icu_gg2.biom.qza

Now that we've filtered our table, we can gather taxonomy information for the amplicon sequence variants represented.

NOTE: Just like filter-features, this command right now will require around 8-10GB of memory.

$ qiime greengenes2 taxonomy-from-table \
>     --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
>     --i-table icu_gg2.biom.qza \
>     --o-classification icu_gg2.taxonomy.qza
Saved FeatureData[Taxonomy] to: icu_gg2.taxonomy.qza

And just like that you've classified your sequence data with Greengenes2!

If you have non-V4 data

Many QIIME 2 users generate amplicons from other variable regions, for very good reasons! Some users may have full length 16S rRNA sequences, or otherwise long fragments which may not be represented in the set of fragments we've already placed. For these situations, what we recommend is to use the non-v4-16s action, which will perform a closed reference OTU picking, using q2-vsearch, against the full length 16S sequences in Greengenes2.

Let's go ahead and try that. As we did in the "V4" scenario, we're going to use redbiom to obtain a dataset. As of writing this post, redbiom does not automatically write out the FeatureData[Sequence], so we're going to extract that information using the biom command and then reformat it into a FASTA file with awk. We'll then import the sequences into QIIME 2. NOTE: The commands below assume the redbiom fetch qiita-study command block from above was run.

$ biom table-ids \
>     -i icu.biom \
>     --observations | \
>         awk '{ print ">" $1 "\n" $1 }' > icu.fna
$ qiime tools import \
>     --input-path icu.fna \
>     --output-path icu.fna.qza \
>     --type FeatureData[Sequence]
Imported icu.fna as DNASequencesDirectoryFormat to icu.fna.qza

To recap, we now have our FeatureTable[Frequency] from before (the icu.biom.qza file). And we just generated our FeatureData[Sequence] artifact. We now have almost everything we need to recruit our data to Greengenes2. The last missing piece is we need to download the backbone 16S rRNA sequences. The backbone represents all of the unique full length 16S rRNA sequences in Greengenes2:

$ wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza

Now that we all of our inputs, we can kick off the Greengenes2 action. You can specify using multiple threads with this action as well to make it run faster:

$ qiime greengenes2 non-v4-16s \
>    --i-table icu.biom.qza \
>    --i-sequences icu.fna.qza \
>    --i-backbone 2022.10.backbone.full-length.fna.qza \
>    --o-mapped-table icu.gg2.biom.qza \
>    --o-representatives icu.gg2.fna.qza
Saved FeatureTable[Frequency] to: icu.gg2.biom.qza
Saved FeatureData[Sequence] to: icu.gg2.fna.qza

Now that we've mapped our data to Greengenes2, let's classify the taxonomy of the sequences! As you may notice, this command is the same as what we used with our V4 data short of some (slightly) different input filenames:

$ qiime greengenes2 taxonomy-from-table \
>     --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
>     --i-table icu.gg2.biom.qza \
>     --o-classification icu.gg2.taxonomy.qza
Saved FeatureData[Taxonomy] to: icu.gg2.taxonomy.qza

If you have shotgun metagenomic data

We recommend processing the short read data using the Woltka toolkit, against version 2 of the Web of Life database (WoLr2 can be found here). After processing your short reads with Woltka, you can use the resulting FeatureTable[Frequency] with the filter-features action of q2-greengenes2. A FeatureData[Taxonomy] artifact can then be generated using the taxonomy-from-table action.

Let's try this out using. First, let's obtain a Woltka processed short read metagenomic run from Qiita using redbiom. For this, we'll need to use a different processing context that we used previously as we want shotgun data not 16S rRNA. Let's look at what Woltka contexts are available:

$ redbiom summarize contexts | grep -i woltka | cut -f 1,2,3 | column -t
Woltka-KEGG-Enzyme-WoLr2-4db94e    37367  3403
Woltka-per-genome-WoLr1-422c8e     35364  10052
Woltka-KEGG-Ontology-WoLr2-7dd29a  37367  9678
Woltka-KEGG-Pathway-WoLr2-58cdd3   37367  110
Woltka-per-genome-WoLr2-3ab352     38110  15326
Woltka-per-genome-WoLr1-1f9978     96     4109

As you can see, there are a few possible ones to select. The first column shown is the context name, the second are the number of samples represented, and the third are the number of features observed. As we are performing a taxonomic assessment, we want to use the "per-genome" context. And additionally, we will use the second version of Web of Life ("WoLr2") as that's what backs Greengenes2:

$ ctx=Woltka-per-genome-WoLr2-3ab352

Next, let's pull some data from from the context irrespective of study. For the sake of the tutorial, let's grab only a small handful by using head, and then import the data into QIIME 2:

$ redbiom fetch samples-contained --context $ctx | \
>     head | \
>     redbiom fetch samples \
>         --context $ctx \
>         --resolve-ambiguities merge \
>         --output woltka.example.biom
$ qiime tools import \
>     --input-path woltka.example.biom \
>     --output-path woltka.example.biom.qza \
>     --type FeatureTable[Frequency]
Imported woltka.example.biom as BIOMV210DirFmt to woltka.example.biom.qza

Great! Now we have a handful of shotgun samples. How can we get the taxonomy out? First, we will filter-features like we did above, as a small set of WoLr2 records are not in Greengenes2. Then, we will use taxonomy-from-table. These commands below assume the wgets for the Greengenes2 data, noted in the above command blocks, have been run:

$ qiime greengenes2 filter-features \
>     --i-feature-table woltka.example.biom.qza \
>     --i-reference 2022.10.taxonomy.asv.nwk.qza \
>     --o-filtered-feature-table woltka_gg2.example.biom.qza
Saved FeatureTable[Frequency] to: woltka_gg2.example.biom.qza
$ qiime greengenes2 taxonomy-from-table \
>     --i-reference-taxonomy 2022.10.taxon
>     --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
>     --i-table woltka_gg2.example.biom.qza \
>     --o-classification woltka_gg2.example.taxonomy.qza
Saved FeatureData[Taxonomy] to: woltka_gg2.example.taxonomy.qza

...and that's it!

If you have paired end data

There are two options. The fragments placed in Greengenes2 are only the forward read, so if you have V4 data, the first option is to simply disregard the reverse read and don't stitch the reads. In which case, you would follow the If you have V4 data steps above. The second option, which will work if you have V4 or non-V4 data, is to use the non-v4-16s action. In which case, you would follow the If you have non-V4 data steps above.

Can I still use Naive Bayes for taxonomy classification?

Yes! In the release files for Greengenes2, and similar to the QIIME 2 data resource, we provide pre-constructed classifiers for the V4 region, as well as for full length.

While we do provide these classifiers, we in general recommend the use of the phylogenetic taxonomy for V4 data (e.g., the taxonomy-from-table actions above). We have observed an improvement in species level per-sample correlation with whole genome shotgun data, relative to Naive Bayes (see figure 2).

Phylogenetic analyses

Once your data are filtered against Greengenes2, through the various scenarios noted above, you can leverage the Phylogeny[Rooted] for calculating UniFrac, Faith's PD or whatever other phylogenetic analysis your heart desires. The key step is to obtain the phylogeny artifact. In the above examples, we represented our ASVs as the sequences themselves, as opposed to MD5 hashed sequences, so we would grab the following tree:

$ wget http://ftp.microbio.me/greengenes_release/current/2022.10.phylogeny.asv.nwk.qza

This tree is compatible with q2-diversity.

Does it matter if I use Deblur or DADA2?

It does not matter if your data are processed with Deblur or DADA2 when running the non-v4-16s action. It may matter when using the filter-features action depending on what type of upstream processing was performed. The majority of the data we inserted from Qiita do not have the primers. If the primers are still on your sequences, it may be necessary to perform a "left" trim for filter-features to work well. Similarly, the majority of the ASVs placed into Greengenes2 are 90nt, 100nt, or 150nt in length so your best bet for overlap is to "right" trim your fragments to one of those lengths.

Interpreting taxon labels

Labels in the taxonomy, and as decorated on the phylogeny, may contain polyphyletic notations. These notations can come from both GTDB and Greengenes2.

First, we utilize the labels that arise from GTDB, such as p__Firmicutes_A, p__Firmicutes_B, etc. Please see the the GTDB FAQ for additional information.

Second, some taxa from the LTP / GTDB hybrid taxonomy we decorate exhibit polyphyly with respect to the Greengenes2 phylogeny. In those instances, we append on the ID of the respective node in the phylogeny. This can result in labels like g__Bifidobacterium_388775 or g__Pirellula_B_754535. The former indicates g__Bifidobacterium is polyphyletic in Greengenes2 (but not GTDB) and `g__Bifidobacterium_388775 is the exact phylogenetic node. The latter describes a taxon that is polyphyletic in GTDB, which is additionally split further within Greengenes2.

We can differentiate the origin of the labels by the character set. GTDB uses capital letters (e.g., "[A-Z]+"), while Greengenes2 uses numbers (e.g., "[0-9]+").

The specific motivation for these notions is to ensure that the expressed taxonomy has an exact relationship to the phylogeny. In contrast, if these labels were omitted, taxa like p__Firmicutes would be non-specific that label would correspond to multiple clades in the phylogeny.

Thank you to user Dengwei Zhang for the inquiry about the labels!

Will Greengenes2 continue to be updated?

Absolutely. We will be refreshing the taxonomy as new versions of GTDB and the Living Tree Project are released. And, we are examining options for incorporating additional sequence, and what a reasonable schedule for doing so will be.

How can I get help?

For right now, if you have questions, concerns or comment with Greengenes2 generally, please email Daniel McDonald at d3mcdonald@eng.ucsd.edu. If you have questions, concerns or comment with the use of Greengenes2 and QIIME 2, please reply to this thread.

colinbrislawn · February 17, 2023, 1:06am

Hello @wasade,

Long time no see!

Just to clarify, can we use the typical taxonomy classification plugins with GreenGenes2? Would these work:

qiime feature-classifier classify-sklearn
qiime feature-classifier classify-consensus-vsearch

If so, are there any caveats to be aware of when using these plugins instead of the pipelines included in the qiime greengenes2 plugin?

wasade · February 17, 2023, 5:23pm

Hey @colinbrislawn

That's a great question. We provide two Naive Bayes trained classifiers, one specific to 515F-806R from the backbone and one based on the full length 16S in the backbone. These are compatible with classify-sklearn. The data resources page for QIIME 2 will reflect this in the upcoming release (corresponding PR).

One thing that is important to consider, which we demonstrated in the preprint, was that we observe an increased Pearson correlation on per-sample taxonomic profiles to paired shotgun data, if we use the phylogenetic taxonomy (e.g., taxonomy-from-table). Briefly, the phylogenetic taxonomy uses the placement information on the phylogeny (based on DEPP), in conjunction with the taxonomic labels placed on the same phylogeny (based on tax2tree). However, this approach only works with V4 ASVs right now.

I'm not aware of a reason why classify-consensus-vsearch would not work. However, I haven't specifically tried it. We do provide compatible artifacts, I'm linking the FeatureData[Sequence] FeatureData[Taxonomy] for the backbone.

Best,
Daniel

sarah872 · February 25, 2023, 1:09pm

May I ask what classifier is the most recent one? The one provided on the qiime2 site (https://data.qiime2.org/2022.11/common/gg-13-8-99-nb-classifier.qza) or the one on the greengenes2 site (http://greengenes.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.nb.qza)?

wasade · February 27, 2023, 4:38pm

Hi @sarah872, great question! The latest Greengenes Naive Bayes classifier for 515f-806r is here, and the latest one for full length is the one that you linked this one specifically.

All the best,
Daniel

iptz1 · March 8, 2023, 2:42pm

Hi,
I have a question on how to train a classifier on V3-V4 regions using Greengenes2 and a specified primer pair. In detail, in the past I usually trained the classifiers using the following commands starting from Greengenes 99% OTU sequences:

qiime feature-classifier extract-reads
--i-sequences 99_otus.qza
--p-f-primer CCTACGGGNBGCASCAG
--p-r-primer GACTACNVGGGTATCTAATCC
--o-reads ref-seqs-classifier.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs-classifier.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier v3v4classifier.qza

How can I proceed to obtain the analogous classifier with this new Greengenes2 database?

Thank you for your precious work and kind help!

Ilaria

wasade · March 8, 2023, 4:44pm

Hi @iptz1,

Thank you for the kind words!

The precise commands we use right now to fit the V4 classifier can be found here.

I think the missing piece for V3-V4 is what --i-sequences to specify with extract-reads, and that would be the backbone full-length sequences.

Best,
Daniel

Xiaojing_Liu · March 9, 2023, 12:10am

hello @wasade !

Thank you for posting this, it is very helpful!
I'm new learner of qiime and I have a very basic question:
I am trying to train my own classifier using greengenes, however, I'm having a hard time finding the files I need for this step (I believe one sequence file and one taxonomy file) from the wensite: Index of /greengenes_release/2022.10.
Is it: 2022.10.backbone.full-length.fna.qza and 2022.10.backbone.tax.qza

Thank you for your help!

wasade · March 9, 2023, 12:29am

Hi @Xiaojing_Liu,

Yes, I believe those are the files that you need

Best,
Daniel

Melisa_Olivelli · March 17, 2023, 8:03pm

Hi!
I also have problems with training the classifier using the Greengenes2. I downloaded the same files than @Xiaojing_Liu but I am completely lost on how to start with these new files. I am trying to train a classifier for my paired-end data. I have a (V3-V4) region (341F:805R). If you could guide me a little I would be very grateful!

wasade · March 17, 2023, 8:19pm

Hi @Melisa_Olivelli,

To train a region specific classifier, it is necessary to use extract-reads from q2-feature-classifier, followed by fit-classifier-naive-bayes. The exact commands used for the V4 classifier with Greengenes2 can be found here.

As an alternative, you could use the classifier trained on the full length records, and should work "out of the box".

All the best,
Daniel

jwdebelius · March 22, 2023, 2:03pm

An off-topic reply has been split into a new topic: Feature classifers in python

Please keep replies on-topic in the future.

Johanndb · March 23, 2023, 1:56pm

Hi @wasade

This is excellent! Thank you kindly for your efforts!

I was just wondering if the command "greengenes2 non-v4-16s" is a separate package I need to download. As I receive the error "QIIME 2 has no plugin/command named 'greengenes2'". Any feedback would be greatly appreciated!

Kind regards,

Johann

wasade · March 23, 2023, 3:07pm

Thanks @Johanndb!

To run qiime greengenes2 non-v4-16s, it is necessary to install the plugin. That can be done with pip install q2-greengenes2.

All the best,
Daniel

iptz1 · March 27, 2023, 5:03pm

Hello @wasade ,
thank you for your kind advice! Can I kindly ask you one more question? I now obtai several taxa names including numbers or capital letters, such as "g__Blautia_A_141781;s__Blautia_A_141781 faecis" or "p__Firmicutes_A;c__Clostridia_258483;o__Lachnospirales;f__Lachnospiraceae;g__Mediterraneibacter_A_155507;s__Mediterraneibacter_A_155507 faecis". So the questions are:

what do the numbers (141781, 258483,155507,..) mean?
what do the letters (Firmicutes _A, Mediterraneibacter _A,...) mean?

Thank you again!
Ilaria

wasade · March 27, 2023, 5:21pm

Hi @iptz1,

Good questions! The _A labels are directly from GTDB (see here for why). The _<number> is used to represent a distinct node in the phylogeny. In this case, "g__Blautia_A" is supported by more than one node, so we have to differentiate them to ensure the taxonomy label is unique. You can find the three Blautia clades in the Greengenes2 website if you'd like to explore the taxonomy directly.

All the best,
Daniel

XIAOXI · April 8, 2023, 3:58pm

Hello, I am also using qiime2 to analyze microbiome data for the first time, I am using V5-V7 region training classifier, I want to ask you if this problem is solved? Can you share your process at this step, thank you very much！

wasade · April 10, 2023, 5:21pm

Hi @XIAOXI,

For V5-V7 data, I recommend using the non-v4-16s action which will perform closed reference OTU picking against the backbone. Or, you could perform naive Bayes classification using the full length model.

All the best,
Daniel

wasade · April 20, 2023, 3:54pm

8 off-topic replies have been split into a new topic: Command not found with redbiom

Please keep replies on-topic in the future.

jwdebelius · April 20, 2023, 2:49pm

An off-topic reply has been split into a new topic: Greengenes Taxonomic Naming Schema: Letters and Numbers

Please keep replies on-topic in the future.