How to create a dereplicated sequence reference database for taxonomy classification: case of COI

Sorry for the delayed response @Nicholas_Bokulich - was traveling to and from a conference. It was a great opportunity to talk a bit about the new tools in QIIME2 with other users, and now I’m back to thinking about my reference database construction. I really appreciate your comments above and have already employed the exclude-seqs argument with a bat-reference database (to filter out bat sequences). However, I’m now thinking about the arthropod-only database. This comes from BOLD, but I’ve been applying a few other filtering criteria for BOLD.

Perhaps this is a question that QIIME already has an answer: I’m wondering about how to preserve as much taxonomic information for a sequence which contains redundant records. For instance, if I run qiime vsearch dereplicate-sequences (info here), it’s not clear to me how to retain the record (sequence + taxonomic info in header) with the most complete taxonomic information. Or, what would happen if there were two identical sequences which contained equally complete but distinct taxonomies.
A post back in April on the developers forum here suggested that the --p-derep-prefix can resolve issues for pseduo-replication of sequence variants, but it’s my understanding this doesn’t address anything to do with taxonomic information.
Consider the information in the following two files:




seq1    k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__;g__;s__
seq2    k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
seq3   k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Pterostichus;s__Pterostichus tristis
seq4    k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus

We can see from refseqs.fasta that there are a pair of identical sequences (seq1 and seq2; as well as seq3 and seq4). If I was to dereplicate these data, in a perfect world, the outcome would be:

  • seq1 and seq2 are concatenated into a single representitive sequence, however, because seq2 contains complete taxonomic information and no other identical sequence contains any opposing information at equivalent levels, I’d preserve the full taxonomic identities of seq2.
    • My concern is that if the program simply retains the first taxonomic entry, then it will lose out on the potential information contained in redundant sequences
  • seq3 and seq4 contain identical sequences, and both contain full taxonomic records through to a species level. However they disagree at both the species and genus ranks. In this case, I’d like a derplicated record to agree where they share a least common ancestor - the Family in this case.

Thus the desired output for a dereplicated fasta would be:


and the associated dereplicated taxonomic file would be:

derep-seq1    k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
derep-seq2    k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__;s__

Hopefully that makes sense. I’m sure you smart microbiologists have been thinking about these things already. Unfortunately the QIIME docs I’ve come across are sparse on considerations when building your own database - I can see why, after starting to have to deal with the web of complications I’m hitting now!

Thanks for your help,

1 Like

Great questions! but unfortunately out of scope for QIIME 2 — this is a question specifically for creating reference databases, and QIIME 2 was not designed for that use case. I am also not sure there is a good answer. Check out what SILVA does for generating the QIIME 2-compatible database — there may even be scripts somewhere that you can use to replicate. They solve that issue by providing several taxonomies — both majority and consensus taxonomies to cover the cases you describe.

Correct, that just takes care of sequences.

Yep, it is a difficult problem! QIIME 2 is not made to solve that problem, so we defer to the database experts — e.g., SILVA, greengenes, UNITE. I recommend following in their footsteps, unless if you want to put together some custom code to do this.

Incidentally, I have been planning on putting together something that would do precisely what you ask, but I have not gotten around to it.

Sorry I have not really answered your question! Only confirmed that it does not have a trivial answer.

Hello Devon,

I agree with Nick that this is outside the scope of Qiime 2, but I also saw your comment on the vsearch GitHub issues and I wanted to ‘qiime in’ about your desire to, as you say,

preserve as much taxonomic information

Are you sure that method won’t over-classify your reads? :scream_cat:

Robert Edgar, the author of muscle and usearch, makes the argument that that over-classification is one of the largest pitfalls of many classifiers, which produce genus level identifications that are essentially worthless and misleading to researchers.

As you focus on minimizing Type II errors in classifications, I’m equally worried about increasing your Type I error rate.

This is a tricky problem and I’m happy you brought it to the qiime forums!



Hi @colinbrislawn,

I think what I’m finding is that, unlike working with 16S or ITS data, I’m supposedly starting with known voucher specimens. This isn’t always true, but the database is formatted in a way in which we could confirm it to be true (you know if the data is mined from Genbank vs. approved by BOLD).

So my logic was that if the sequence really is a voucher specimen, and there are instances in which you have varying levels of completeness of taxonomic information for an identical sequence, then it seemed appropriate to fill in the missing data and risk the possibility of introducing an error. You could imagine the scenario where a bug has been collected, and one biologist has only a leg to work with. They know the bug is within some taxonomic family, but that’s it. So they sequence the leg, and generate the sequence. Then a year goes by, another biologist get a whole bug and can declare the species of the insect. They sequence it, and it’s the same as the other bug which was only described to family.

What I’d love to know is the degree with which that situation actually happens - how often do we have identical sequences with different levels of taxonomic completeness? And, what is the distribution of incomplete records vs. complete records for the same sequence? If there were, say, 10 complete records and 1 incomplete one, then you’d feel better about going with the complete information. How often does that happen?
I get the point about not introducing incorrect information. But as I understand it with Vsearch, I’m leaving the taxonomic assignment to a sequence up to chance - if there were two records, one complete, one incomplete, then we’re leaving Vsearch to randomly pick which one is the representative taxonomy. That seems avoidable to me if you could quantify the distribution of completeness of records.

There’s a secondary problem too, but at least I’ve found one solution to that so far. Having used Jon Palmer’s amptk program for a while to do most of my amplicon work, he wrote a python script that includes the option of incorporating the reverse of what I’m describing: taking two non-unique but equally complete taxonomic strings with identical sequences, and reverting back taxa levels until they match - the Lowest Common Ancestor approach. That program has that feature built in, and in about 4 million BOLD records, the identical sequence but competing taxonomy thing happens about 10,000 times. 0.25% seems trivial, but who knows when you’re using those 10,000 samples…

I just don’t have enough programming experience. I got throughly slapped at stackoverflow though when I tried posing this problem to the folks with the wisdom. It would be awesome if someone could just help me figure out how to test it; I’m not saying it’s the right approach, but I’d like to know how frequently the situation of incomplete/complete records with redundant sequences actually happens. I can write out what needs to happen in English, but when I tried my hand in R and Python all day today I quickly realized how inept I was with the required syntax.

@colinbrislawn you raise some very good points.

The consensus portion of classify-consensus-vsearch is intended to prevent this type of overclassification by finding consensus taxonomies across sequences. How this parameter influences overclassification (and underclassification) is covered in this paper. We only benchmarked 16S and ITS data there, but I would expect similar trends for other marker genes. To state it briefly: setting maxaccepts=1 will almost certainly overclassify, but higher settings of maxaccepts and min-consensus will reduce this risk (and instead tend toward underclassification).

If anything, I think that @devonorourke is at risk of underclassification, since he is setting very high maxaccepts values. It also sounds like he is actually benchmarking these results using a mock community, so should be able to answer this question for us. :smile:

I agree with Robert, overclassification is a big and overlooked issue. We optimized the methods in QIIME 2 to minimize overclassification, but these will need to be re-tested for COI and other marker genes.

@devonorourke’s problem is a little different, though — as I understand it, he is still at the stage of database construction, which is a formidable but non-novel issue, hence my suggestions to follow in the footsteps of SILVA/greengenes developers. Issues of overclassification will come up later when he reaches the classification step.

The general solution

SILVA taxonomies are formatted so that seq replicates with divergent taxonomies use either the consensus or the majority. Greengenes uses consensus (I believe). So

>seq3;Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Pterostichus;s__Pterostichus tristis
>seq4;k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus
>seq4;k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus

Would become:

AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus


AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__;g__;s__
AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__;s__

Thanks @Nicholas_Bokulich.
This still leaves me wondering what to do in the event of a tie…

1 Like

Check out this workflow from the amazing @SoilRotifer. I expect he will have advice to enrich this discussion.

I see this script from @William on step 9 of that workflow, but it looks like that just slices the levels to create a uniform 7-level taxonomy and does not find the consensus/majority taxonomy. @SoilRotifer @William could you please point us in the direction of the relevant code?

Great question. Once we find the relevant code, we will learn what SILVA does…

With step 6 in that workflow, you pick OTUs with different clustering levels. I think that would eliminate the opportunity to quantify when identical sequences have different taxonomic strings, no?

Thanks for the great stuff though!

in a sense dereplication == 100% OTUs… so you can’t follow that workflow to a T, but same idea…

  1. dereplicate
  2. map your unique seqs to the original taxonomies
  3. find the consensus/majority taxonomies

Here are the scripts for creating consensus/majority taxonomy strings:


of course. not enough coffee yet…

I think what I’m after is a 2-pronged approach with regards to consensus. If only I spent more time in college learning Python and less time taking slapshots…

In my own head, its:

  • for any record with non unique sequence,
  • collect all possible tax strings associated with that sequence
  • break it up into it’s 7-level taxa set
  • quantify the number of occurrences at each taxa set and choose the majority sequence

I’d like to also tweak the system so that the lack of information isn’t necessarily counted in the calculation. That is, empty information is worth a value of 0. Thus our majority calculation is only influenced if there is actual competing information.

What I’d love to know if this could be implemented, is how often are the results different? How often are we possibly neglecting the majority of “unknown” with the minority instances of known?

Make sense?

1 Like

Thank you @William!

That must be good for something too! :ice_hockey: :trophy: :

This should be fairly easy — you do not need to tweak your code, you can just drop those sequences from the reference database. It looks like you already have taxonomic rank handles for each level (e.g., p__, c__, etc), so you can just use grep to drop anything with an empty level. I suppose that gets complicated if you want to exclude those in the consensus specifically at levels where they are empty, but take them into account at more basal levels. But just pointing out that a quick and dirty alternative exists.

So many great questions, so few answers. Another one that illustrates that database curation is a tricky task and an automated solution may not be optimal. The incomplete entries may be valid entries where experts could not positively identify the species but know the genus… a large number of complete entries may well be incorrectly annotated! And they certainly are for 16S databases. So assuming complete == good may make too many assumptions.

This gets back to @colinbrislawn’s points about overclassification. Your suggested tweak will certainly make an impact, but we just don’t know whether it’s the good kind of impact! Hopefully your benchmarks will answer this for us for COI data.

1 Like

@devonorourke et al., I have split this topic, since we have diverged into new territory (database building) from the original topic (a probable formatting issue causing a cryptic error).

1 Like

Hello all. This may be a bit off topic for this thread but I think I’ve come up with a way to provide clean Greengenes-like taxonomy formatting for the SILVA reference sequences. For example, here is some on-screen output from some prototype code I’ve written to parse the d,k,p,c,o,f,g ranks:

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Bumilleriopsis AM491616.1.1802

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Chlorellidium FJ030892.1.1781

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Mischococcus AF083400.1.1806

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Pleurochloris AF109728.1.1788

I’ve been working on this a little each day and I still have some minor details to work out. But hopefully it is a start. Once I have the prototype code complete I’ll upload and link it via the forum. The solution appears to be embarrassingly easy.


1 Like

Thank you @SoilRotifer! Sounds like it may be pertinent to @devonorourke’s goals to create clean taxonomy strings (though from a COI database, not SILVA).

Definitely share your solution as a “community resource” or tutorial on the forum when it is ready!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.