Original formating of BOLD data

Hi, what format is your originally downloaded database in? Because when I run the script:

cat *.seqNtaxa.csv |
grep -v 'sequenceID,taxon,nucleotides' |
awk -F ',' '{OFS="\t"};{print $1";tax="$2, $3}' |
sed 's/^/>/g' |
tr '\t' '\n' > bold_allrawSeqs.fasta

The output looks like this:

Which I don't think is the information that this is trying to generate... I couldn't use the R script to download the database, so I used the BOLD interface which worked fine but I don't know what information this pipeline needs so that I can continue.

Any advice would be helpful!

Thanks,
Kat

1 Like

Hi @klunn,
I believe the input file (seqNtaxa.csv) contains metadata and sequence data in a comma-separated data structure. The script you've highlighted in this question is attempting to convert that original format into a and are converting it into a file that is used typically to host sequence information (.fasta). We'd expect a .fasta to have lines beginning with > to hold metadata information, and lines without that to include sequence information:

>metadata_information_1
ATATCGCG
>metadata_information_2
CCAATTGG

The output file you are showing is following the a .fasta format, but the text that has been processed and converted from .csv --> .fasta looks wrong. Where you have lines like L#08BANFF-001 would be where we expect sequence information to be. Instead, it looks like you have metadata where sequence data should be found.
Without providing access to your specific file, my suggestion would be to look into the .csv directly using something like Microsoft Excel. Then find out what columns contain the information for the sequence ID, the taxonomic information, and the nucleotide sequences. If you do not see those data, the the problem is not with the script itself, but the input file.
Hope that helps

2 Likes

Hi @devonorourke , yes this was very helpful! Thank you :slight_smile:

1 Like