I'm attempting to cluster amplicons with a UNITE ITS database. My lab has an amended USEARCH database from UNITE containing known orchid mycorrhizal fungi (OMF) mostly from the families Serendipitaceae and Sebacinaceae. Is there a way, perhaps using a python script, to convert this modified database for use with QIIME 2.
For testing, could you give me the first few lines of your USEARCH database as an attachment?
If you would be willing to run something like head -n 20 database.uc > database_sample.uc
and then attach that file to a new post, I could use that to test out some command and try to find a format that works well for you.
Thank you for the welcome, and thanks for offering to help!!
As requested, I've attached some sample files. I've included two files because it seems some of the added sequences and taxonomy are in a slightly different format (just to complicate things a little more ). Some have a fasta header followed by | then and identifier with no spaces; while others have the fasta header followed by | - then it continues on the next line.
The bad news is that usearch_unite_its_sample.txt has two different names in each header. I assume we want the first name like KF410664 instead of the second name like SH1140861.08FU, is that correct?
Here's my code so far. I will update this post once I get feedback from you about how well it works:
Export fasta file: This should work, let me know!
cat usearch_*.txt | sed 's/|.*//g' > combined_usearch.fasta
Export tax file: Work in progress!
cat usearch_*.txt | grep '^>' | \ # take only header lines
sed 's/>//; s/|.*=/\t/; s/:/__/g; s/,/; /g; s/;$//' | \ # find-and-replace
sed 's/\(.__\).__[^;]*/\1/g' > \ # clean unknown tax levels
combined_usearch.txt
The taxonomy one is extra tricky.
It seems that when a taxonomy level is unknown, the last known level is listed. f:Sebacinaceae,g:f__Sebacinaceae,s:g__f__Sebacinaceae;
The convention in Qiime is to leave these levels blank. f__Sebacinaceae; g__; s__;
That last sed command should create Qiime 2 style tax levels, even for odd taxa like f__Thelephoraceae; g__; s__Thelephoraceae_sp;
I've enjoyed working on my unix commands!
Try out these and let me know how they work,
Colin
P.S. And because I'm posting code, I might as well post a license to use it: LICENSE.txt (1.5 KB)