Taxonomic completenes, missingness in BOLD COI database

devonorourke · January 1, 2019, 7:52pm

This post is motivated by an earlier thread with the fantastic @Nicholas_Bokulich; in short, I was noticing that quite a few of my representative sequences were being assigned very incomplete taxonomies, and it was frustrating me to understand why so many contained only Order-level information instead of, say, including at least Family or Genus level info.

I work with arthropod COI data, and as far as I know, anyone trying to resolve what kind of insects are in a sample all use the same database: the Barcode of Life Database (BOLD). I've seen a few QIIME users (and plenty of others outside of QIIME) talk about using the (BOLD). There are even a few programs to mine this dataset programmatically - for example in R (ex. bold) and python (bold-identification).

I got to wondering just what was in there... and more to the point, what wasn't in there. Specifically, I wondered just how much taxonomic information was missing for a given sequence. In a perfect world, every sequence would have a complete taxonomic description from Kingdom through Species; we do not live in that perfect world. As you'll see in the plots below, I've come to realize the distribution of taxonomic completeness is not equal across the largest groups of arthropods...

I decided to pull all arthropod records in BOLD that contained a geo description "United States": 161,714 sequences (with records) in all. I focused on the top 9 insect Orders (among all arthropods, the Insecta Class is vastly more represented than any other) and performed the following calculations:

For a given insect Order, how many times does a sequence record contain complete information: annotations including Order, Family, Genus, and Species.
Likewise, how many times does a record contain Order, Family, and Genus, but not species.
How many times does a record contain Order and Family, but not Genus and Species.
How many times does a record contain only Order information, but not Family, Genus, and Species

I was shocked to realize the extent with which a supposedly curated database contains sequence records with so little taxonomic information. These aren't bacteria - we don't have a culture-methods problem here... these are bugs. I would think entomologists submitting these types of data can do better than identifying a beetle from a moth...

It's actually easier to see the extent of these differences if you normalize the numbers relative to their respective Orders. This is the same plot, just expressing those 4 levels of taxonomic completeness as percentages:

This isn't a small sample size thing either - the smallest Order posted here (Ephemeroptera) has over 4800 unique sequence entries!

I then wondered if maybe somehow the taxa were being overrepresented due to the fact that BOLD mines their data from Genbank in addition to direct submissions (a seemingly dirty not so secret thing molecular ecologists are neglecting in their work). Maybe a bunch of entries among those poorly annotated Dipterans and Lepidopterans were a result of someone submitting multiple samples of the same specimen. This plot is not using dereplicated data - it's all sequence entries that match "United States" and "COI" in BOLD.

It turns out that if we dereplicated the data the relative proportions work out nearly identical (the absolute number of sequences shrinks a bit, but it doesn't look like there is any disproportionate number of redundant sequences by taxonomic Order).

So, long post to get to the point of the question:
Is there any sense in the microbial community among the ITS and 16S folks how their database representation looks like at relatively more inclusive groupings like Phyla or Class? Is that too broad a task?

Thanks for your comments

Nicholas_Bokulich · January 4, 2019, 8:08pm

Thank you for sharing @devonorourke ! I have not seen a formal assessment of 16S/ITS databases like this, but problems like this are common, though I believe to a much lesser degree. Sequences are incompletely annotated at all levels, but mostly just species and genus are missing (and in the case of greengenes this is intentional if that OTU maps to to multiple species/genera) — most sequences are annotated at phylum/class level, though there are some stragglers (e.g., SILVA and UNITE databases will have lots of stuff like "unknown fungus").

In my own work I have found that removing things like "unknown fungus" improves classification for ITS sequences (see here and here I think), so I suspect that a similar filter would help remove some rubbish here; I have not seem the same benefits for 16S databases, though (mostly because those missing annotations are predominantly at genus/species levels, so have less of a discombobulating effect on the classifiers).

Unfortunately issues like this are just part of the headaches with compiling, curating, and using sequence reference databases!