Where is the best discussion forum for the GreenGenes 2 bioinformatics database? I tried to run a sample through GreenGenes 2, because I heard it has more accurate genera classification. Unfortunately, it also has lots of genera that are completely unrecognizable to me. Examples:
SFMI01
CAJLXD01
CAG-269
I cannot find some of these even in PubMed full text. Having major parts of the biome assigned to genera that appear to have almost no documentation is troubling. Is there any way to get some semantics for these genera? I know their taxonomy and that is not what I am asking for. I want to know where they were isolated, what was their prior assignment, and is there any known effect in humans or animals. Having a distribution for human biomes would also be nice.
Is there any place that would document these new GreenGenes 2 genera better?
cc'ing @wasade as he is the primary developer for the greengenes2 database and the q2-greengenes2 plugin - he should be able to provide you with recommendations here!
The Greengenes2 taxonomy is based on GTDB and the Living Tree Project. Greengenes2 does not introduce new names, but it use unique suffixes to differentiate labels which are polyphyletic. As an example, you can find more information about SFMI01 on GTDB's website.
The problem there is these SFMI01 references on GTDB just give clues to taxonomy, but it says nothing about the semantics of the genus. Is is a pathogen or a beneficial species? Greengenes 2 is saying a human sample has 9.5% of this genus, so it's substantial. Yet I cannot find any research where it is studied for its effects/behavior.
I would guess that is a candidate taxa, which a significant amount of the (true) tree of life is
Remember that you cannot reliably determine whether a genus, or even a species, is a pathogen or not. Take for example E coli which is a pathogen, commensal, and probiotic depending on the strain. And to make it more difficult, most organisms have not been successfully grown in a lab which limits the potential for experiment.
If this is 16S V4 data, what I recommend is taking the ASV, querying redbiom and see where else it has been found. Or, pulling out the AGP/Microsetta data, and seeing whether the presence of that feature correlates with any interesting phenotypes in those independent data. If the sequences are not V4, then you can do the same type of thing but the scope of well characterized datasets available is reduced.
A similar approach could be taken if these are shotgun data where by you pull existing samples based on the OGU, and refine an analysis from there.
Maybe this is an oversimplified / wrong understanding, but is there a simple one-to-one map of the "new" GG2 genera that maps it to the older genus name? In other words, either SFMI01 is something totally new and previously undiscovered, or it was previously known by a different name. Shouldn't there be a reference that tells us which of these two is the case, and if there is a name remapping what was the old name?
The taxon history in GTDB suggests it's been around for a while, see e.g. this record. Why not inquire with GTDB and see what they can indicate about its naming?
...so actually, on closer look, I would guess GTDB named the clade after that record. The GTDB record describes the accession, which is "SFIM00000000.1" and corresponds to this Genbank record.