Converting a USEARCH UNITE ITS database to QIIME 2 format

Williams · March 18, 2021, 4:03pm

I'm attempting to cluster amplicons with a UNITE ITS database. My lab has an amended USEARCH database from UNITE containing known orchid mycorrhizal fungi (OMF) mostly from the families Serendipitaceae and Sebacinaceae. Is there a way, perhaps using a python script, to convert this modified database for use with QIIME 2.

The USEARCH database is formatted as follows:

AB096870|AB096870;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Sebacinales,f:Sebacinaceae,g:f__Sebacinaceae,s:g__f__Sebacinaceae;
TCCGTAGGTGAACCTGCGGAAGGATCATTATTGATTTTGATTTGTTGCCTTCTAGT

I would like it to be formatted for use with QIIME 2, whereby there should be two separate files, a reference sequence file and a taxonomy file:

reference sequence file

AB096870
TCCGTAGGTGAACCTGCGGAAGGATCATTATTGATTTTGATTTGTTGCCTTCTAGT

taxonomy file
AB096870
k__Fungi;p__Basidiomycota;c__Agaricomycetes;f__Sebacinaceae;g__unidentified;s__unidentified

Thanks you very much to anyone who can help!!

colinbrislawn · March 18, 2021, 7:53pm

Hello @Williams,

Welcome to the forums!

I might be able to help out with this.

For testing, could you give me the first few lines of your USEARCH database as an attachment?

If you would be willing to run something like
head -n 20 database.uc > database_sample.uc
and then attach that file to a new post, I could use that to test out some command and try to find a format that works well for you.

Colin

Williams · March 18, 2021, 9:17pm

usearch_unite_its_sample.txt (10.4 KB) usearch_unite_its_sample_2.txt (6.6 KB)

Thank you for the welcome, and thanks for offering to help!!

As requested, I've attached some sample files. I've included two files because it seems some of the added sequences and taxonomy are in a slightly different format (just to complicate things a little more ). Some have a fasta header followed by | then and identifier with no spaces; while others have the fasta header followed by | - then it continues on the next line.

colinbrislawn · March 19, 2021, 10:48pm

I know that feeling!

The good news is that I'm seeing consistent spaces in my editor. Maybe it's a word-wrap thing.

The bad news is that usearch_unite_its_sample.txt has two different names in each header. I assume we want the first name like KF410664 instead of the second name like SH1140861.08FU, is that correct?

Here's my code so far. I will update this post once I get feedback from you about how well it works:

Export fasta file: This should work, let me know!

cat usearch_*.txt | sed 's/|.*//g' > combined_usearch.fasta

Export tax file: Work in progress!

cat usearch_*.txt | grep '^>' | \ # take only header lines
  sed 's/>//; s/|.*=/\t/; s/:/__/g; s/,/; /g; s/;$//' | \ # find-and-replace
  sed 's/\(.__\).__[^;]*/\1/g' > \ # clean unknown tax levels
  combined_usearch.txt

The taxonomy one is extra tricky.

It seems that when a taxonomy level is unknown, the last known level is listed.
f:Sebacinaceae,g:f__Sebacinaceae,s:g__f__Sebacinaceae;

The convention in Qiime is to leave these levels blank.
f__Sebacinaceae; g__; s__;

That last sed command should create Qiime 2 style tax levels, even for odd taxa like
f__Thelephoraceae; g__; s__Thelephoraceae_sp;

I've enjoyed working on my unix commands!
Try out these and let me know how they work,
Colin

P.S. And because I'm posting code, I might as well post a license to use it:
LICENSE.txt (1.5 KB)

Williams · March 22, 2021, 2:13pm

Hey Colin,

Thank you very much for all of this!!!

I should be able to work through it today, then I'll report back on how it works.

system · April 22, 2021, 8:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.