If you have an Illumina FASTQ corresponding to a human stool sample, which genomics databases and bioinformatics pathway(s) will give you the best accuracy at the genus level? Is there any research that ranks the databases / pipelines by sensitivity and specificity for genera identification?
Here are the two papers I would start with. They are pretty different, so the contrast should provide a good overview of the field.
- Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin | Microbiome
- SinTax https://www.biorxiv.org/content/10.1101/074161
They agree in one very significant way: over-classification, where novel taxa are incorrectly given existing names, is a huge problem.
bioinformatics pathway(s)
You may have already found the qiime tutorials, but if not, check out how PD-Mice does it.
Hey @colinbrislawn and @pone,
My test code isn't working, so can I have taxonomy thoughts ?
I think @colinbrislawn's reference are FANTASTIC! I also think the RESCRIPt paper is a great addition to the list, if you have time for an additional manuscript.
...Ive got a few other recs, if you want more papers.
But, I think there's a deeper issue, which is the whole question of "accuracy". Because accuracy requires having a stable ground truth about what sometime is and how it's named. And, the reality is that we don't have a stable, ground truth for taxonomy. The hypothesis around what an organism is, what it should be called, and how that should reflect current and historical understanding shifts all the time. There are several papers, including the RESCRIPt paper, that show taxonomic databases don't agree on assignments. This reflects a lot of trends in the field, including the introduction of new organisms, the constant cycle of re-naming things (including phyla ), the lack of formal names for uncultured organisms that are never the less present, the fact that our reference collections aren't always characterized fully, and the fact that everyone has a slightly different model of what makes "good" taxonomy. The gut microbiome kind of exemplifies this issue because while its commonly analyzed, it has a lot of uncultured or difficult to culture orgnaisms and the names are constantly changing.
So, I think maybe a better question to ask here is "what is the most useful database" for genus level identification, where "useful" may reflect things like:
- Do you want the genus level assignments to compare to the existing literature?
- How closely does the taxonomy need to align with the phylogeny?
- How tolerant are you of new names that haven't been fully mapped to their older labels?
- How new do you need the database to be?
There are also some convenience considerations as well, like the existance of pre-trained classifiers or even pretrained bespoke classifiers that I think maybe factor in.
I'm sure if you asked the mods and users here we'd have a range of opinons on our most "useful" database is for gut studies, based on our analytical needs.
Best,
Justine
I can think of one way to determine the accuracy of different databases with a particular pipeline:
-
Have a computer program arbitrarily create a very complex biome. That becomes "ground truth" for purposes of what follow.
-
Create a random FASTQ that maps to the virtual biome in step 1). That means you create DNA fragments that resemble what is seen in the real world for a given species/taxa.
-
Run your pipeline/database against that FASTQ.
-
Compare the result to "ground truth"
Now repeat the above thousands of times, with step 1) always creating a new random biome. What this might do in effect is a kind of Monte Carlo simulation, and over the course of many iterations important differences in the pipeline should emerge.
To answer your questions, what I want is:
-
Genera should compare to the literature, because otherwise deriving any semantics on whether the genus is "good/bad" becomes difficult.
-
The taxonomy does not need to align closely to phylogeny
-
New names is problematic. I already saw running a 16s through Greengenes 2 that there were genera with 5%+ abundance that are not referenced in a single PubMed study. That makes it impossible to generalize about the biome.
-
I do not need the database to be the newest, especially since that seems to create the problem with "new names". I need to have the greatest accuracy for the largest number of genera seen in human biomes, that map to literature.
-
Convenience with things like pre-trained classifiers is probably important, because I am not doing a PhD or commercial project to create new pipelines. Rather I am trying to improve quality in looking at a few human biomes, because most of what is being sold commercially is only accurate for about 60% of the genera that are being identified.
Hi @pone,
I appreciate your computational solution, and I think htat's a great way to solve which classifier is most accurate. In fact, that's what they did in the "beating naive bayes" paper.
Whether or not a classifier is accurate is an entirely different problem than whether or not a taxonomic database is accurate.
The problem for taxonomy is that we're modeling the biological world. When I say there's not ground truth, its not like organisms walk up to us with name tags in latin1 that says "Hello, I'm Homo Sapien sapien" or "Hello I'm Capra hircus" or "Hello, I'm Pseudomonas aeroginosa". Some of this is because taxonomy is a system we're trying to impose on nature because we need a way to classify it to study it, not because there's some inherent natural benefit in that classification.
Because its a human developed system and study, you get different, sometimes competing, models. Again, there is no simple ground truth of what we're trying to model, because life2 doesnt care about pesky things like organism classification, it just wants to, as Jurassic park puts it, "find a way" . Different models are going to reflect different ways of classifying the organisms (different models of taxonomy) and different states of the field, based on varying value propositions, parties, and needs. Its a little bit easier in environments where you're trying to classify cultured organisms where names are pretty well accepted (the vagina, the mouth). Its a massive problem when you're trying to classify organisms in an environment where many of the organisms are uncultured, and therefore cannot have an official name based on the Official Rules of How We Name Organisms (aka the Prokeryotic Naming Committee).
Based on your value propositions, I'd recommend working with Silva 138.1 as a reference database. I think its going to suit your needs best as its widely used, well established, and doesnt rely on the taxonomy-phylogeny relationship. You could look into HOMD or Optivag as enviroment specific databases, but that may not work as well with Naive Bayesian classification.
That suggestion made, remember that taxonomic classification is not just a function of the database and the classifier, it's also constrained by the biology of your 16S sequence. I can have an accurate database and a perfect classifier and if the organisms I'm trying to classify are identical over the amplicon region Im working with, then Im not going to be able to disambiguate them.
Best,
Justine
Snarky Footnotes
1 Unless you go to a mixer int he classics department. Which is great to learn your colleagues names, but less useful in figuring out the shape of the tree of life.
2 Okay, the vast majority of life. Sans a handful of pedantic humans.
That's a wonderful paper, Justine! Good find!
Beating Naive Bayes at Taxonomic Classification of 16S rRNA Gene Sequences (PMC8249850)
Our “perfect” classifier tests underline the fact that evolutionary conservation in most genetic targets for microbiome profiling limits the degree of taxonomic resolution that is possible, particularly when sequencing short marker-gene reads. Hence, mature, existing methods for classification (NBC and some alignment-based classifiers) have already neared the upper limits of classification accuracy.
@pone, are you finding this conversation as helpful as I am?
I often arrive at a tipping point on a project where I ask myself
'Why hasn't this been done before'
which is a good reminder for me that someone probably has, and if I go looking for their work I can save myself a ton of time.
I wanted to follow up on a few of the wanted items here. There are I think some important misunderstandings about what is possible and easy, which are not necessarily classifier issues.
Critically, you cannot determine whether a genus (or species) is good or bad by name alone. Many organisms that are "good" like those in yogurt can be pathogens in unusual circumstances.
Formally named genera will be noted in LPSN. However, much of the tree of life hasn't been formally named.
And, DNA sequencing data are not sufficient to determine whether the observed taxa are living or dead (though that may not always matter for a phenotypic effect, this paper I think).
There are many labels where disregarding phylogeny is misleading. Take "Clostridia" for example which doesn't mean anything from an evolutionary perspective.
In Greengenes, our position is that a taxonomy must correspond to phylogeny to be interpretable. Phylogeny offers a (more) objective basis for the relationships of taxa, whereas taxonomy alone is inherently subjective as it is a human derived entity.
We see this even with full length 16S, and it's worse when non-western populations are considered. Unfortunately, reference databases are lacking representation of a large portion of life on the plant. And of the life included, much of it has not been formally named.
The labels, their interpretation, and reliability in any existing paper are relative to the reference used in that paper as well as the methods applied to the data. This want is a very hard.
Best,
Daniel
LPSN contains no reference to the GG2 genus SFMI01. I don't understand how you are proposing to use it. I would like to understand if SFMI01 is something newly discovered, or whether it is a remapping of something old.
LPSN describes formally named taxa, this isn't formally named and therefore most likely hasn't been cultured. Curators, like those at GTDB, assign candidate names to phylogenetic clades so that there is at least some type of name for these portions of the tree of life. But remember, most of life on the planet is unnamed at high levels of specificity, much is unnamed below phylum (and there may be undiscovered phyla still), most life has not been cultured, and most like probably is not represented yet in reference databases.