How to create a dereplicated sequence reference database for taxonomy classification: case of COI

Hello all. This may be a bit off topic for this thread but I think I've come up with a way to provide clean Greengenes-like taxonomy formatting for the SILVA reference sequences. For example, here is some on-screen output from some prototype code I've written to parse the d,k,p,c,o,f,g ranks:

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Bumilleriopsis AM491616.1.1802

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Chlorellidium FJ030892.1.1781

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Mischococcus AF083400.1.1806

d__Eukaryota; k__Stramenopiles; p__Ochrophyta; c__Xanthophyceae; o__Mischococcales; g__Pleurochloris AF109728.1.1788

I've been working on this a little each day and I still have some minor details to work out. But hopefully it is a start. Once I have the prototype code complete I'll upload and link it via the forum. The solution appears to be embarrassingly easy.

-Best
-Mike

1 Like