dereplicating "error" ... or, how to fix a typo.

devonorourke · August 31, 2020, 9:56pm

I've been working on building a COI database from NCBI references, and separately, building a COI database from BOLD references. I wanted to combine the two and evaluate this merged data set. Part of that merging requires a dereplication step after combining the sequence and taxonomy records. I was getting an output wasn't quite right - the dereplicated output was created without error, but a large number of sequences contained nearly empty taxonomy strings. Something 1/3 of all my sequences (tens of thousands!) had a string that looked like this:

derep_001        Animalia;Chordata; c__; o__; f__; g__; s__

Weird, right? Why would so many taxa get squashed that way. I got started with a bit of detective work by running VSEARCH --derep_fulllength and generated a .uc file to see if I could identify cases where BOLD and NCBI sequences were good hits, but one was chosen over the other. Here's one of those hits:

NCBI_abc	Animalia;Chordata;Actinopteri;Tetraodontiformes;Monacanthidae;Meuschenia;Meuschenia scaber
BOLD_xyz	Animalia;Chordata;Actinopterygii;Tetraodontiformes;Monacanthidae;Meuschenia;Meuschenia scaber

Did you spot it? Those darn fish ...

@BenKaehler - when I run get-ncbi-data, I'm sure Actinopteri is the Class you end up pulling, but I'm wondering why, and how often, BOLD is using these alternative naming schemes across the entire Taxonomy string ( example here, BOLD's Actinopterygii vs. NCBI's Actinopteri).

@SoilRotifer - There must be a way I can try to automate this, right? Something like making a list of all the strings from Kingdom through Family, maybe, and getting total counts of each, and then comparing how often the two names are identical... ugh. This seems like it's destined to be a manual task.

Thanks for any strategies you can think of that might help me account for instances when BOLD and NCBI are going to end up using different naming strategies.

SoilRotifer · September 1, 2020, 12:01am

Hi @devonorourke,

I've encountered this quite often. Welcome to the world of ever-changing and inconsistent taxonomies and taxonomy curation. Back in the day when I used UTAX / SINTAX (usearch), there was a step where it would warn you about "multiple parents" for lower-level taxonomies. Which would be highlighted in this case. There was another output that outlined exactly how the dereplicated sequence taxonomy was produced. In short... it was a very manual process.. as I would just remove records that were obviously wrong.

You've probably seen many arthropod and mollusc taxonomies that have the same exact sequence! Basically, researchers submit a sequence for one.. when it is really a contaminant sequence of another. This combination gave me many empty taxonomy strings.

Also, this is where other LCA taxonomy options may help. Majority-rule LCA might alleviate this problem, as it will just take the taxonomic annotation that occurs more often for that sequence... then again, that may note completely work either.

Not sure if this is possible, but if there is a way to construct a mapping of BOLD IDs to NCBI IDs, you can just replace one with the other, to use the taxonomy string of choice. For each case you'll have to decide if you want the BOLD or the NCBI string. Either, or both, could be wrong...

-Mike

devonorourke · September 1, 2020, 12:19am

Thanks for the validation in knowing I'm not crazy, and there isn't a single and simple step to resolve this.

I'm going to amend the title at the moment as:
"Resolving taxonomic identities in COI, or "...

On the plus side, as I've started to make manual edits, I'm learning about the wide world of animal taxonomy... so many weird and amzing critters...

Turns out that I can account for a lot of the major discrepancies between BOLD and NCBI at the Class level. After a bit of digging, it appears that it's frequently the case that a BOLD class name is a subclass in NCBI, and making that switch to the NCBI name is sufficient to resolve the two completely. Fortunately, just a handful of these Class names resolve tens of thousands of sequences:

BOLD taxa | NCBI taxa
## direction of switch
-----------------------
c__Actinopterygii --> c__Actinopteri
c__Reptilia --> c__Lepidosauria
c__Elasmobranchii --> c__Chondrichthyes
c__Copepoda --> c__Hexanauplia
c__Thecostraca --> c__Hexanauplia

Of course, I'm not proposing these are the correct names. Just that they should represent the same kind of thing with the same name, insofar as for my comparison of shared labels. If you spot anything here that seems like the switch should be in the reverse, feel free to let me know.

BenKaehler · September 2, 2020, 6:11am

Hi @devonorourke, along the same lines, I'm not sure whether this is useful, but you could ask for

--p-ranks kingdom phylum superclass order family genus species

and that would solve the problem for Meuschenia scaber. It would probably cause a bunch of other problems, though.

SoilRotifer · September 2, 2020, 1:12pm

@devonorourke You're giving me horrible flashbacks!
I was looking through some of my old parsing code from several years ago, and what I had to do, depending on the marker gene and the available taxonomy for that marker gene, I would do what @BenKaehler suggested... that is, look at the overall accounting of available ranks for the sequence data I was trying to taxonomically curate. Often this required me to use a super- or sub- rank. I wish I would have remembered to suggest this earlier. But it looks like you've got it!

devonorourke · September 2, 2020, 1:51pm

Great - thanks for the idea @BenKaehler.

The problems, as you might expect, run deep, and aren't as simple as a single taxonomic level change. In fact, in the example I gave above with the fish, NCBI uses both terms ("Actinopterygii" and "Actinopteri") as a Class label, so... I guess I need to just do the best I can, manually, for now.

Makes me think there will never be a simple way to programmatically merge BOLD and NCBI. In my first foray into cleaning up BOLD and NCBI records, there's not an obvious "X references are less messy", but perhaps others could articulate a reason to favor one over another.

Thanks for the help!

devonorourke · September 2, 2020, 7:47pm

In case anyone was interested in this thread (other than RESCRIPt crew of @SoilRotifer, @Nicholas_Bokulich, @BenKaehler @thermokarst), I thought I'd share the first draft of the results among the BOLD and NCBI labels. Happy to get feedback to improve clarity.

The goal was to depict how often a label might be similar or different between NCBI or BOLD taxonomies. Notably, I'm only talking about : animal taxa for the COI marker gene.

Here's how to read the figure:

I broke down this analysis among labels at Phylum, Class, Order, Family, and Genus levels. Each of these levels is represented as a horizontal face in the figure (level labels indicated at right).
Researchers who submit sequences and taxonomic info for a given reference can (and do!) submit information to BOLD and NCBI simultaneously. To account for this, I split up my data into three sets:

References collected from BOLD's site ("BOLD")
References from NCBI with a keyword that matches those cross referenced with BOLD ("NCBIob", aka NCBI-Only-Bold")
References from NCBI that lack the keyword that matches those cross referenced with BOLD ("NCBInb", aka NCBI-Not-Bold")

With three distinct sets of information, there are 7 ways they might interact. Each of these interactions are depicted as discrete labels along the x-axis. For example, the first x-axis label shows the number of taxonomic labels that are shared among all 3 sets - it's an "inner join" for all 3 groups. The second-fourth columns' are 2-way intersections. These represent those labels that are both in two groups, but not in the third group. I've labeled these as "exclusive" sets. The final three columns represent those taxonomic labels unique to a given set.

So what are the values? There are actually two sets of numbers represented in each column:

The y axis values scale from 0 to 1, and represent the fraction of taxonomic labels in the given set being depicted.
- For the first four columns, this is the number of shared labels, relative to the "universe" of labels - that is, the sum of all possible unique labels among the three groups (BOLD, NCBIob, NCBInb).
- For the final three columns however, the number of shared labels is relative to the specific dataset itself. This was to avoid the weirdness of a situation where a single group may have entirely unique taxonomic labels, but if it was plotted relative to the "universe" of labels, it might still only appear to have 1/3 of it's labels listed as unique.
The integer values plotted above each bar represent the number of distinct labels in that group. For instance, the first value shown in the first column of the Phylum label is 17 (see the top-left-most column). This means that there were 17 unique Phylum names shared among all three datasets. If you move across that Phylum face to the fourth column ("Exclusive BOLD + NCBInb"), the value is 6. That means that there were six unique Phylum names found in both of those datasets that weren't present in the NCBInb dataset. One other thing to keep in mind - if you sum these integers across a taxonomic level (i.e. horizontal facet), you'll get the "universe" of all possible taxonomic labels. Thus, there are 31 unique Phylum labels (17+6+3+5) across all of BOLD and NCBI records I gathered.

system · October 4, 2020, 1:47am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.