Reconciling 16s Databases in a Single Pipeline

pone · August 27, 2024, 7:25pm

Is it possible to create a pipeline for 16s sequences that would try to reconcile sequence assignments across multiple databases? The point of this is to expose ambiguities and different assignments explicitly within any CSV file you produce from the analysis.

If you make NCBI the "master" database, the idea is that each taxon assigned in NCBI would then be broken down into the corresponding assignments made in the other databases. This might be particularly helpful when one of the databases makes an unusual assignment. I had mentioned in other posts that I ran a human biome through Greengenes 2 and it assigned about 9.5% to a genus "SFMI01" that there is very little information about. Imagine instead that you had an assignment (one or more) to NCBI taxa, and then the CSV would show you where the assignments to SFMI01 happened in a CSV column for GG2. In NCBI the taxon might be "Unclassified" but seeing a mapping from unclassified to SFMI01 would still be useful.

Has anyone already written such a pipeline?

SoilRotifer · August 27, 2024, 8:03pm

HI @pone,

I do not see why not, but it'd not be an easy task...

That being said, this is one of the many reasons why we built RESCRIPt, see the paper here. In fact, Figure 5 highlights the differences in taxonomic annotation that you refer to.

Each database can differ because they can make use of a different taxonomic schema, and might make different decisions about curating the data / taxonomy. I'd highly recommend reading these articles (there are more but these are a good start):

pone · August 27, 2024, 8:27pm

How does RESCRIPt produce its output? Is there an option for tabular-format data that clearly shows mapping from one database to another? The bar breakdown as in Figure 5 is visually interesting but lacks precision.

SoilRotifer · August 28, 2024, 1:25pm

There are quite a few different outputs. You can read through the tutorials.

There are no tools explicitly available within RESCRIPt for the task of mapping between databases. You can use a mix of commands from RESCRIPt , QIIME 2, other code, to help work your way there. Not ideal, I know.... There is still much to be done in this regard. RESCRIPt is an active project, so we're always looking to add more tools.

jwdebelius · August 28, 2024, 2:19pm

Hi @pone and @SoilRotifer,

A slightly different perspective, but I recently ran into an issue where I had to ensure my names in GTDB matched a list of names from a clinical colleague. You can use the search function ont he GTBD website and get a list of genomes that match that string.

I put in your example genus:

When I look this up with full context, I see its a member of order Christensenellales and Aristaeellaceae; these might be good places to start looking. Remember that microbial taxonomy is a hypothesis, as I mentioned in my other post. We're constructing it based on sequences, and not characteristics. Most of the organisms youre dealing with in a gut community, for example, are uncultured. And so the expectation that a genus have complete information and that you wouldn't need to rely on higher level information confuses me.

P.S. You might look for commentary by the ATCC folks about NCBI as a reference standard.

Best,
Justine

SoilRotifer · August 28, 2024, 3:36pm

Hi @pone,

We've been having some discussion on this topic and here are some thoughts:

On thing that might be helpful for a comparison like this is that qiime feature-table tabulate-seqs now takes one or more optional FeatureData[Taxonomy] artifacts, which makes it straight-forward to compare how different classifiers assigned taxonomy to the same set of sequences.
Providing the per-sample and total frequencies per feature output generated by qiime feature-table summarize-plus as metadata to qiime feature-table tabulate-seqs can facilitate exploration, for example by allowing you to sort the table by number of samples or total frequency of the features.
Basically, you can follow something similar to this approach to append multiple taxonomies to a table, or what-ever, etc...

EDIT: obviously this deals with amplicon data. But you can use similar approaches on the reference data. The only issue would be to deal with mismatching sequence IDs, or subsets of sequences sets between the databases.

Anyone else, please feel free to jump in!

-Cheers!
-Mike