Introduction
We are excited to announce a new release of the Greengenes database, full redesigned from the ground up, backed by whole genomes, with a focus on harmonizing 16S rRNA and shotgun metagenomic datasets. We're calling the new database Greengenes2 due to substantial changes in the design. However, like the original Greengenes, it relies on a de novo phylogeny, and expresses a taxonomy that is derived from the phylogeny.
The focus of this forum post is on how to use the new Greengenes2 database. Detail on the design and evaluation of the resource can be found in the associated Nature Biotechnology publication. We also provide an interactive website to explore the database. The raw files are released under the BSD-3 license and hosted on a public anonymous FTP. A QIIME 2 plugin for interacting with the database is also available.
Background
How is Greengenes2 constructed? Briefly, Greengenes2 relies version 2 of the Web of Life as a backbone phylogeny. The Web of Life is a phylogenomic tree based on 381 marker genes. The phylogeny is then updated, using uDance, with 330k unique high quality full length 16S rRNA from the Living Tree Project, 16S rRNA derived from full length ribosomal operons (data from both Earth Microbiome Project and human fecal samples), and the Genome Taxonomy Database. We then place >20M 16S rRNA V4 Deblur amplicon sequence variants into the tree, using DEPP, derived from over 300k public and private microbiome samples in Qiita. Finally, we construct a taxonomy by integrating the Genome Taxonomy Database with the Living Tree Project, and decorate this onto the phylogeny using tax2tree.
The result is a single phylogeny and taxonomy that is suitable for both short read shotgun metagenomics and 16S rRNA sequencing studies. It allows for directly integrating 16S rRNA and shotgun metagenomic datasets. In addition, we preserve the taxonomic curation from GTDB including its polyphyletic labelings. And, for studies focused on the V4 region of the 16S rRNA gene, taxonomy can be derived directly from the phylogeny without requiring use of Naive Bayes, which appears to yield a higher resolution result relative to Naive Bayes.
How can I use Greengenes2?
The easiest way to use Greengenes2 is through q2-greengenes2
. The plugin provides operations for comparing your data against the resource. How you compare your data depends on the type of data that you have. We'll cover a few different scenarios below and provide specific examples of the interaction.
Before we move on, let's install the Greengenes2 plugin!
$ pip install q2-greengenes2
...the next time you run the "qiime" command, it will need to recache its environment which will take a few seconds.
For our examples below, we're going to obtain data from Qiita using redbiom as a way to demonstrate a few ways to use Greengenes2. Neither Qiita nor redbiom are required, they are merely a data source. On that note, let's go ahead and install redbiom:
$ pip install redbiom
If you have V4 data
Greengenes2 contains over 20,000,000 16S rRNA V4 amplicon sequencing fragments, derived from a dizzying collection of public and private microbiome samples in Qiita, representing a very large cross section of environment types. We chose to place V4 fragments, as opposed to other variable regions, as V4 is the predominant region within Qiita. (n.b., in the future we are considering how to place other variable regions).
If you have V4 data, then there is a very good chance the majority of your sequencing data are already represented by Greengenes2 -- this greatly simplifies interaction with the database: we just need to perform a set intersection between what's in your FeatureTable[Frequency]
and what exists in the database!
Let's take a look at an example. To source data, we're going to rely on redbiom, which acts as a caching layer to Qiita and provides a command-line driven way to obtain FeatureTable
s and sample metadata. You can learn more about redbiom
in our QIIME 2 tutorial.
First, we're going to set a "context" for redbiom, which represents a group of consistently processed samples (hint: you can find contexts with redbiom summarize contexts
). Then, we're going to download the "extreme" microbiome study from Qiita, which is under study 2136. Last, we're going to import the data into QIIME 2:
$ ctx=Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b
$ redbiom fetch qiita-study \
> --context $ctx \
> --study-id 2136 \
> --remove-blanks \
> --output-basename icu
$ qiime tools import \
> --input-path icu.biom \
> --output-path icu.biom.qza \
> --type FeatureTable[Frequency]
Imported icu.biom as BIOMV210DirFmt to icu.biom.qza
Next, we're going to filter our FeatureTable[Frequency]
against Greengenes2. To do so, we're going to download a Phylogeny[Rooted]
representation of the taxonomy. In Greengenes2, we provide both classic tab delimited taxonomy files (the ".tsv" file variant), and a tree representation. The tree representation is substantially compressed and simplifies some downstream computations.
The name of the file we're going to download is "2022.10.taxonomy.asv.nwk.qza", which means it is "taxonomy" data with the feature IDs represented as the actual amplicon sequence variants; the "nwk" indicates it is a NewickFormat
internally, which is a way of representing tree structures. Additionally, we're going to use the variation of the taxonomy that encodes amplicon sequence variants as the ASVs themselves (as opposed to MD5 hashes). The reason we're using the "asv" representation is that, by default, redbiom outputs amplicon sequence variants as the sequences themselves.
What if my ASVs are hashed? Simple! Just use the "2022.10.taxonomy.md5.nwk.qza" artifact, and you'll be good to go.
NOTE: The filter-features
command right now requires around 8-10GB of memory. While conceptually, were simply taking a set intersection of features, the dataset itself is quite large -- this is something we're looking at optimizing in the future.
$ wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.taxonomy.asv.nwk.qza
$ qiime greengenes2 filter-features \
> --i-feature-table icu.biom.qza \
> --i-reference 2022.10.taxonomy.asv.nwk.qza \
> --o-filtered-feature-table icu_gg2.biom.qza
Saved FeatureTable[Frequency] to: icu_gg2.biom.qza
Now that we've filtered our table, we can gather taxonomy information for the amplicon sequence variants represented.
NOTE: Just like filter-features
, this command right now will require around 8-10GB of memory.
$ qiime greengenes2 taxonomy-from-table \
> --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
> --i-table icu_gg2.biom.qza \
> --o-classification icu_gg2.taxonomy.qza
Saved FeatureData[Taxonomy] to: icu_gg2.taxonomy.qza
And just like that you've classified your sequence data with Greengenes2!
If you have non-V4 data
Many QIIME 2 users generate amplicons from other variable regions, for very good reasons! Some users may have full length 16S rRNA sequences, or otherwise long fragments which may not be represented in the set of fragments we've already placed. For these situations, what we recommend is to use the non-v4-16s
action, which will perform a closed reference OTU picking, using q2-vsearch
, against the full length 16S sequences in Greengenes2.
Let's go ahead and try that. As we did in the "V4" scenario, we're going to use redbiom to obtain a dataset. As of writing this post, redbiom does not automatically write out the FeatureData[Sequence]
, so we're going to extract that information using the biom
command and then reformat it into a FASTA file with awk
. We'll then import the sequences into QIIME 2. NOTE: The commands below assume the redbiom fetch qiita-study
command block from above was run.
$ biom table-ids \
> -i icu.biom \
> --observations | \
> awk '{ print ">" $1 "\n" $1 }' > icu.fna
$ qiime tools import \
> --input-path icu.fna \
> --output-path icu.fna.qza \
> --type FeatureData[Sequence]
Imported icu.fna as DNASequencesDirectoryFormat to icu.fna.qza
To recap, we now have our FeatureTable[Frequency]
from before (the icu.biom.qza
file). And we just generated our FeatureData[Sequence]
artifact. We now have almost everything we need to recruit our data to Greengenes2. The last missing piece is we need to download the backbone 16S rRNA sequences. The backbone represents all of the unique full length 16S rRNA sequences in Greengenes2:
$ wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza
Now that we all of our inputs, we can kick off the Greengenes2 action. You can specify using multiple threads with this action as well to make it run faster:
$ qiime greengenes2 non-v4-16s \
> --i-table icu.biom.qza \
> --i-sequences icu.fna.qza \
> --i-backbone 2022.10.backbone.full-length.fna.qza \
> --o-mapped-table icu.gg2.biom.qza \
> --o-representatives icu.gg2.fna.qza
Saved FeatureTable[Frequency] to: icu.gg2.biom.qza
Saved FeatureData[Sequence] to: icu.gg2.fna.qza
Now that we've mapped our data to Greengenes2, let's classify the taxonomy of the sequences! As you may notice, this command is the same as what we used with our V4 data short of some (slightly) different input filenames:
$ qiime greengenes2 taxonomy-from-table \
> --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
> --i-table icu.gg2.biom.qza \
> --o-classification icu.gg2.taxonomy.qza
Saved FeatureData[Taxonomy] to: icu.gg2.taxonomy.qza
If you have shotgun metagenomic data
We recommend processing the short read data using the Woltka toolkit, against version 2 of the Web of Life database (WoLr2 can be found here). After processing your short reads with Woltka, you can use the resulting FeatureTable[Frequency]
with the filter-features
action of q2-greengenes2
. A FeatureData[Taxonomy]
artifact can then be generated using the taxonomy-from-table
action.
Let's try this out using. First, let's obtain a Woltka
processed short read metagenomic run from Qiita
using redbiom
. For this, we'll need to use a different processing context that we used previously as we want shotgun data not 16S rRNA. Let's look at what Woltka
contexts are available:
$ redbiom summarize contexts | grep -i woltka | cut -f 1,2,3 | column -t
Woltka-KEGG-Enzyme-WoLr2-4db94e 37367 3403
Woltka-per-genome-WoLr1-422c8e 35364 10052
Woltka-KEGG-Ontology-WoLr2-7dd29a 37367 9678
Woltka-KEGG-Pathway-WoLr2-58cdd3 37367 110
Woltka-per-genome-WoLr2-3ab352 38110 15326
Woltka-per-genome-WoLr1-1f9978 96 4109
As you can see, there are a few possible ones to select. The first column shown is the context name, the second are the number of samples represented, and the third are the number of features observed. As we are performing a taxonomic assessment, we want to use the "per-genome" context. And additionally, we will use the second version of Web of Life ("WoLr2") as that's what backs Greengenes2:
$ ctx=Woltka-per-genome-WoLr2-3ab352
Next, let's pull some data from from the context irrespective of study. For the sake of the tutorial, let's grab only a small handful by using head
, and then import the data into QIIME 2:
$ redbiom fetch samples-contained --context $ctx | \
> head | \
> redbiom fetch samples \
> --context $ctx \
> --resolve-ambiguities merge \
> --output woltka.example.biom
$ qiime tools import \
> --input-path woltka.example.biom \
> --output-path woltka.example.biom.qza \
> --type FeatureTable[Frequency]
Imported woltka.example.biom as BIOMV210DirFmt to woltka.example.biom.qza
Great! Now we have a handful of shotgun samples. How can we get the taxonomy out? First, we will filter-features
like we did above, as a small set of WoLr2 records are not in Greengenes2. Then, we will use taxonomy-from-table
. These commands below assume the wget
s for the Greengenes2 data, noted in the above command blocks, have been run:
$ qiime greengenes2 filter-features \
> --i-feature-table woltka.example.biom.qza \
> --i-reference 2022.10.taxonomy.asv.nwk.qza \
> --o-filtered-feature-table woltka_gg2.example.biom.qza
Saved FeatureTable[Frequency] to: woltka_gg2.example.biom.qza
$ qiime greengenes2 taxonomy-from-table \
> --i-reference-taxonomy 2022.10.taxon
> --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
> --i-table woltka_gg2.example.biom.qza \
> --o-classification woltka_gg2.example.taxonomy.qza
Saved FeatureData[Taxonomy] to: woltka_gg2.example.taxonomy.qza
...and that's it!
If you have paired end data
There are two options. The fragments placed in Greengenes2 are only the forward read, so if you have V4 data, the first option is to simply disregard the reverse read and don't stitch the reads. In which case, you would follow the If you have V4 data steps above. The second option, which will work if you have V4 or non-V4 data, is to use the non-v4-16s
action. In which case, you would follow the If you have non-V4 data steps above.
Can I still use Naive Bayes for taxonomy classification?
Yes! In the release files for Greengenes2, and similar to the QIIME 2 data resource, we provide pre-constructed classifiers for the V4 region, as well as for full length.
While we do provide these classifiers, we in general recommend the use of the phylogenetic taxonomy for V4 data (e.g., the taxonomy-from-table
actions above). We have observed an improvement in species level per-sample correlation with whole genome shotgun data, relative to Naive Bayes (see figure 2).
Phylogenetic analyses
Once your data are filtered against Greengenes2, through the various scenarios noted above, you can leverage the Phylogeny[Rooted]
for calculating UniFrac, Faith's PD or whatever other phylogenetic analysis your heart desires. The key step is to obtain the phylogeny artifact. In the above examples, we represented our ASVs as the sequences themselves, as opposed to MD5 hashed sequences, so we would grab the following tree:
$ wget http://ftp.microbio.me/greengenes_release/current/2022.10.phylogeny.asv.nwk.qza
This tree is compatible with q2-diversity
.
Does it matter if I use Deblur or DADA2?
It does not matter if your data are processed with Deblur or DADA2 when running the non-v4-16s
action. It may matter when using the filter-features
action depending on what type of upstream processing was performed. The majority of the data we inserted from Qiita do not have the primers. If the primers are still on your sequences, it may be necessary to perform a "left" trim for filter-features
to work well. Similarly, the majority of the ASVs placed into Greengenes2 are 90nt, 100nt, or 150nt in length so your best bet for overlap is to "right" trim your fragments to one of those lengths.
Interpreting taxon labels
Labels in the taxonomy, and as decorated on the phylogeny, may contain polyphyletic notations. These notations can come from both GTDB and Greengenes2.
First, we utilize the labels that arise from GTDB, such as p__Firmicutes_A
, p__Firmicutes_B
, etc. Please see the the GTDB FAQ for additional information.
Second, some taxa from the LTP / GTDB hybrid taxonomy we decorate exhibit polyphyly with respect to the Greengenes2 phylogeny. In those instances, we append on the ID of the respective node in the phylogeny. This can result in labels like g__Bifidobacterium_388775
or g__Pirellula_B_754535
. The former indicates g__Bifidobacterium
is polyphyletic in Greengenes2 (but not GTDB) and `g__Bifidobacterium_388775 is the exact phylogenetic node. The latter describes a taxon that is polyphyletic in GTDB, which is additionally split further within Greengenes2.
We can differentiate the origin of the labels by the character set. GTDB uses capital letters (e.g., "[A-Z]+"
), while Greengenes2 uses numbers (e.g., "[0-9]+"
).
The specific motivation for these notions is to ensure that the expressed taxonomy has an exact relationship to the phylogeny. In contrast, if these labels were omitted, taxa like p__Firmicutes
would be non-specific that label would correspond to multiple clades in the phylogeny.
Thank you to user Dengwei Zhang for the inquiry about the labels!
Will Greengenes2 continue to be updated?
Absolutely. We will be refreshing the taxonomy as new versions of GTDB and the Living Tree Project are released. And, we are examining options for incorporating additional sequence, and what a reasonable schedule for doing so will be.
How can I get help?
For right now, if you have questions, concerns or comment with Greengenes2 generally, please email Daniel McDonald at d3mcdonald@eng.ucsd.edu
. If you have questions, concerns or comment with the use of Greengenes2 and QIIME 2, please reply to this thread.