Greengenes2 2022.10 Does Not Have Correct Taxonomic Classifications for Mitochondria and Chloroplast

Roland_Wilhelm · November 2, 2024, 1:48pm

Hi QIIME2 Community,

My research group and I recently stumbled on a few projects related to host and plant microbiomes because we were using the Greengenes2 database. It turns out that they have not yet fixed a known issue with how the mitochondrial and chloroplast 16S rRNA genes are annotated in their taxonomic classifications.

They were notified of this issue in November of 2023 (link) and again in March of 2024 (link). As of this posting, they have still not fixed this issue.

Our group has returned to using Silva (current version: 138_2) which reliably provides this information along with current taxonomic names.

This should be a clear warning about the ability of the Greengenes2 maintainers to keep up with basic needs of this community. I am assuming that in the past year, countless studies of host and plant microbiomes have been contaminated with host sequences due to what ostensibly be a simple fix.

I hope the Greengenes2 maintainers can make this fix happen ASAP and also explain to the community how they will deal with such critical fixes in a more timely manner in future.

colinbrislawn · November 2, 2024, 1:49pm

Hello Roland,

Please refer to the release notes for greengenes2 2024.09

Please consider editing your post, at the very least for tone.
We are all trying to help each other, here.

Roland_Wilhelm · November 4, 2024, 1:59pm

Hi Colin,

Thanks for drawing the recent fix (October 2024) to my attention. My post was meant to light a fire to draw attention to a substantive problem. I'm sorry if you found the tone off. I wish the Greengenes2 team well, and am grateful for their work. My criticism is not intended to be personal.

But, let's not skirt around accountability. The issue was flagged 11 months ago and it was obviously a major concern given the number of host-associated microbiome projects and the profile of your article (published in Nature Biotechnology). This error will require many projects to be reanalyzed (it has in my research group) and this will only occur when the microbiome community becomes aware of the issue. I hope the Greengenes2 team might speak to their efforts to notify users of the flaws in the first version of their database (Greengenes 2; 2022.10)? This would go a long way to rebuild trust in their work. My group and I missed the news, which has left us scratching our head about how something like this could have gone on for so long!

gregcaporaso · November 7, 2024, 4:28pm

Hi @Roland_Wilhelm,
Thanks for your messages here. I understand your frustration and want to weigh in. To be clear, I'm not involved directly with the Greengenes2 project (nor is @colinbrislawn, who replied on this thread, as far as I know), so I can't comment on the workflow/timeline/etc for that project. I do want to comment from the perspective of how QIIME 2 can better serve users and developers running into issues like this.

First, I will say that it is challenging to alert users to these types of issues in a way that they notice because forum posts/tweets/etc are easy to miss. This is a general challenge - not something specific to QIIME 2 or QIIME 2 plugins. We are working on some new functionality that I hope will help alert users to this type of thing in the future. Specifically, we're building functionality that will scan the data provenance associated with a QIIME 2 Result that is being viewed (e.g., with QIIME 2 View, qiime tools view, or Provenance Replay), and cross reference that information against a database of known issues. Users will then be alerted to an issue that is known to impact their Result when they're viewing the Result, rather than having to monitor the forum for issue reports that may or may not impact them.

In a situation where a taxonomic classifier is producing problematic results, we would tag the UUID of the classifier and associate it with a warning and recommendations for addressing the issue in our database, and then present that information very prominently on QIIME 2 View (e.g.) when it's relevant. Because provenance is recorded for all steps of an analysis, any Result that was downstream of that classifier (e.g., taxonomy bar plots, an ANCOM-BC visualization, etc) would trigger the warning and recommendation to be presented to the user.

I realize this doesn't address your current issue, but I wanted to share this to let you know that the QIIME 2 team takes this type of thing very seriously, and we're actively working on functionality to help.

wasade · November 13, 2024, 10:27pm

Hi @Roland_Wilhelm,

Thank you for expressing concern about the taxonomic classification of mitochondria and chloroplasts.

Greengenes2 2022.10 includes the mitochondria and chloroplast sequence records from SILVA 138.1. For its taxonomy, Greengenes2 relies on expert curated systems coupled with an objective automated taxonomy decoration. Specifically, we use LTP (a resource of SILVA) and GTDB as the taxonomy source. GTDB follows naming priority from LPSN.

As we learned after the release of 2022.10, neither LTP nor GTDB annotate mitochondria and chloroplast. LPSN also does not appear to have entries associated. There are two factors related to resolving taxonomy for these taxa, which are not applicable to the other taxa in Greengenes2, and require special case handling.

First, we need to some level of taxonomic detail for the records. Those labels need to be biologically meaningful, and to use the same namespace as the rest of the database. To address this, we considered both SILVA and the NCBI Taxonomy.

The NCBI Taxonomy correctly describes the mitochondria and chloroplast records in SILVA as Eukaryotes (see e.g. the SILVA mitochondria record for A. taxiformis). Directly using that taxonomy information would break automated name placement of Bacteria. We further cannot use most of the lineage information for these records, because we would contradict the NCBI Taxonomy if we, for example, described Opisthokonta as being part of Bacteria.

SILVA accounts for this by retaining the species name, without naming intermediate ranks, which creates a gapped taxonomy between the species and family (for mitochondria) or order (for chloroplast). However, Greengenes2 does not use a gapped taxonomy. In the absence of a clear denotation within the species name, it would potentially be misleading and lead to negative feedback from users to describe well known Eukaryote species as Bacteria.

To resolve these issues for 2024.09, we adopted the lineage information from SILVA, dropped the species labels, and remapped the higher order taxa into the Greengenes2 namespace.

Second, during development of 2022.10, we were concerned about the impact that mitochondria and chloroplast 16S rRNA records would have on topology updates with uDance. Out of caution, we did not include these records in the topology update, and instead placed them using DEPP. Fragment placement does not affect the backbone topology.

Taxonomy decoration occurs on the backbone, and is propagated to the DEPP insertions. Because mitochondria and chloroplasts were not in the 2022.10 backbone, we could not take the derived taxonomy from SILVA and factor it in at the time of decoration alongside the rest of the taxonomy data for Greengenes2. We then considered whether we could identity specific phylogenetic nodes, and manually place labels. However, neither mitochondria nor chloroplasts fall within a single clade in the phylogeny, and labeling their respective lowest common ancestors would be incorrect for other taxa. We suspect the lack of monophyly is due to a combination of (1) the exclusion of these records in the topology update and (2) a limited set of full length 16S rRNA for these organelles in the backbone (if any are present, they would be unlabeled in the set of ribosomal operons sequenced from American Gut and Earth Microbiome Project samples used in the resource).

We struggled with how to appropriately represent these important taxa for 2024.09. The taxonomy for Greengenes and Greengenes2 is derived from the phylogeny, but the absence of appropriate clades for the taxa meant we had to deviate from this convention. The path of least resistance was to express the sequences and taxonomy during training the Naive Bayes classifier models made available with the release, which is possible because the Naive Bayes models do not directly use the phylogeny.

Special casing these taxa is not ideal, and constraining the taxonomy data to the Naive Bayes classifiers is pragmatic. The evaluation of including mitochondria and chloroplast in the backbone itself is part of ongoing work with the backbone which was initiated about 18 months ago, and we anticipate they will be included in the next backbone update, but it is a surprisingly challenging technical problem.

We appreciate your patience and support as we work to improve this resource.

Sincerely,
The Greengenes2 development team