Hi @Roland_Wilhelm,
Thank you for expressing concern about the taxonomic classification of mitochondria and chloroplasts.
Greengenes2 2022.10 includes the mitochondria and chloroplast sequence records from SILVA 138.1. For its taxonomy, Greengenes2 relies on expert curated systems coupled with an objective automated taxonomy decoration. Specifically, we use LTP (a resource of SILVA) and GTDB as the taxonomy source. GTDB follows naming priority from LPSN.
As we learned after the release of 2022.10, neither LTP nor GTDB annotate mitochondria and chloroplast. LPSN also does not appear to have entries associated. There are two factors related to resolving taxonomy for these taxa, which are not applicable to the other taxa in Greengenes2, and require special case handling.
First, we need to some level of taxonomic detail for the records. Those labels need to be biologically meaningful, and to use the same namespace as the rest of the database. To address this, we considered both SILVA and the NCBI Taxonomy.
The NCBI Taxonomy correctly describes the mitochondria and chloroplast records in SILVA as Eukaryotes (see e.g. the SILVA mitochondria record for A. taxiformis). Directly using that taxonomy information would break automated name placement of Bacteria. We further cannot use most of the lineage information for these records, because we would contradict the NCBI Taxonomy if we, for example, described Opisthokonta as being part of Bacteria.
SILVA accounts for this by retaining the species name, without naming intermediate ranks, which creates a gapped taxonomy between the species and family (for mitochondria) or order (for chloroplast). However, Greengenes2 does not use a gapped taxonomy. In the absence of a clear denotation within the species name, it would potentially be misleading and lead to negative feedback from users to describe well known Eukaryote species as Bacteria.
To resolve these issues for 2024.09, we adopted the lineage information from SILVA, dropped the species labels, and remapped the higher order taxa into the Greengenes2 namespace.
Second, during development of 2022.10, we were concerned about the impact that mitochondria and chloroplast 16S rRNA records would have on topology updates with uDance. Out of caution, we did not include these records in the topology update, and instead placed them using DEPP. Fragment placement does not affect the backbone topology.
Taxonomy decoration occurs on the backbone, and is propagated to the DEPP insertions. Because mitochondria and chloroplasts were not in the 2022.10 backbone, we could not take the derived taxonomy from SILVA and factor it in at the time of decoration alongside the rest of the taxonomy data for Greengenes2. We then considered whether we could identity specific phylogenetic nodes, and manually place labels. However, neither mitochondria nor chloroplasts fall within a single clade in the phylogeny, and labeling their respective lowest common ancestors would be incorrect for other taxa. We suspect the lack of monophyly is due to a combination of (1) the exclusion of these records in the topology update and (2) a limited set of full length 16S rRNA for these organelles in the backbone (if any are present, they would be unlabeled in the set of ribosomal operons sequenced from American Gut and Earth Microbiome Project samples used in the resource).
We struggled with how to appropriately represent these important taxa for 2024.09. The taxonomy for Greengenes and Greengenes2 is derived from the phylogeny, but the absence of appropriate clades for the taxa meant we had to deviate from this convention. The path of least resistance was to express the sequences and taxonomy during training the Naive Bayes classifier models made available with the release, which is possible because the Naive Bayes models do not directly use the phylogeny.
Special casing these taxa is not ideal, and constraining the taxonomy data to the Naive Bayes classifiers is pragmatic. The evaluation of including mitochondria and chloroplast in the backbone itself is part of ongoing work with the backbone which was initiated about 18 months ago, and we anticipate they will be included in the next backbone update, but it is a surprisingly challenging technical problem.
We appreciate your patience and support as we work to improve this resource.
Sincerely,
The Greengenes2 development team