Best sequence database for classifying 16s v4 sequences of human stool samples

rosew · November 17, 2023, 2:40pm

Hello,
Now that Greengenes2 is available, can I ask what others consider to be the best sequence database for classifying 16s v4 sequences of human stool samples?
Is there any community opinion on whether it's better than SILVA or RDP? What about NCBI?
Also, separate from my first question - is there a particular sequence database that is better at classifying Bifidobacterium and Bacteroides down to species level?
thanks in advance for your input.

jwdebelius · November 17, 2023, 7:30pm

Hi @rosew,

Welcome to the :qiime2: forum!

I'm going to give my opinion here, I'm sure others have insight. As usual, the answer is It Depends , which is basically the unofficial QIIME 2 forum motto.

I haven't played with Greengenes2 yet (unfortunately, all my current projects are past that decision stage); but I'm excited about its ability to integrate metagenomics and 16S data as well as the phylogenetically coherent taxonomy. This will mean your taxa names are a little bit different, but pretty much none of the taxonomic databases agree, so I think that's kind of a moot point.

If you want species level classification, I would not use Silva. They don't curate their species labels and its generally bad practice to rely on them. Of course, I'm not a big fan of the whole "species" hypothesis and recommend the use of ASVs as your externally valid labeled unit instead of a definition based on either 95% ANI (meaning mammals all belong to the same "species" equivalent bucket) or a species defination based on sexual reproduction which doesnt make sense for a domain of life that picks up genetic material because they saw it and it was cool.

I think the question(s) you want to ask are the following:

Will your discussion include a section where you feel pressured to compare your taxonomic results to other papers, regardless of the database they've used? If so, pick a database that aligns with the papers you're comparing to. Greengenes2 probably isnt the best answer for this, unfortunately, since its new.
Are you willing to gamble about how well Greengenes2 will be adopted vs existing databases? (I tend ot err on the side of newer technologies)
Can you deal with the fact that Greengenes phylogenetically coherent taxonomy is going to be slightly different from the more standard labels. Are you willing to accept that what was classically "Firmicutes" is infact a polyphyletic phylum and "Firmicutes_A" a different set of wee beasties?

Best,
Justine

SoilRotifer · November 17, 2023, 10:37pm

I agree with @jwdebelius, it depends on your questions, and the resolution by which you analyze your data. I also prefer to analyze via ASVs, as they will not change... but the taxonomy assigned to them might change at a later date given the ever changing world of microbial taxonomy. The taxonomy will also vary between databases depending on the taxonomic schema, and nomenclatural rules they decided to follow.

Some other thoughts ...

I became aware that RDP is no longer funded, and is likely not to going to be maintained much longer... at least not regularly. I am sure someone else closer to the matter can comment on this.
GTDB also follows the philosophy of providing a phylogenetically consistent taxonomy.
- Fundamentally, I really like the approach that Greengenes2 and GTDB are taking, even if it means that taxonomic labels will be in flux for a while.
Some databases might require further curation.
I do not necessarily trust species-level classifications for such short SSU reads. Many are likely mis- or over-classifications. But your mileage may vary.
Keep in mind, that choosing the proper primer-pair / variable region can be just as important in obtaining accurate taxonomy. There are many papers out there that discuss which primer pairs are ideal for disambiguating taxa that are common for given sample types.
You can compare across classifiers / reference databases, and use tools like RESCRIPt to compare them.

Mehrbod_Estaki · November 25, 2023, 2:16am

I think the most important aspects have already been nicely described by @SoilRotifer and @jwdebelius. So my quick two cents from a user's perspective in favor of Greengenes 2, again with the caveat of it really does depend on your needs.

I've now switched all my analyses to GG2 because I am giving my full support to GTDB/GG2 philosophy of phylogenetically coherent taxonomy, even if it means some growing pains with regards to misaligning with previous names. I also find GG2 to be much faster compared to my regular workflow with short reads (16S) which was to use SEPP to insert my reads into a phylogenetic backbone. The GG2 tree is huge and so they have already done most of the heavy lifting during its development. I also think GG2's design lends itself nicely to more frequent updates so I'm hopeful in not having to abandon it for the next shiny tool so soon.
And lastly, GG2 does lend itself nicely to combining 16S datasets to metagenomics ones and I see this as a big next step in the field when we're doing large metanalyses.