12s database using rescript - australian taxa only?

Hey all, I'm working on building a 12s database using rescript, I appreciate all the help out there. I was wondering if there's a way that I can download sequences from NCBI that are only Australian taxa? That would help reduce the size of the initial download and also make my reference library more specific.
I've been looking at this tutorial, which I found on one of the forum posts here. Is there perhaps a way I can modify the first piece of code to include only sequences from Australia?
Thanks!

The tutorial I've been following:

1 Like

Hey Devon, thanks for the tutorial. I was just wondering if when using the Bold Data Pull script in R, is there a way I can select to download Australian taxa only? I can do that when I go direct to Bold's website, but downloading that way hasn't worked out for me becuase the files end up with heaps of additional data in them. I'd like to try using the script, but I don't want or need to download everything, just Australian taxa. Thanks!

1 Like

Hi @Tessa_M ,

Great question!

Maybe yes, but probably not in a simple way. The first block of code is specifying which sequences to download, using an entrez query. The txid33208 part is the key part, specifying that all metazoa sequences should be downloaded that fit the following criteria (12S gene, etc). txid33208 is the NCBI taxonomy ID (taxid) for metazoa.

I am not sure if it is possible to use a location keyword with Entrez — probably not — and even if you could, it would require the accessions to have that metadata entered... and species with a broader distribution (e.g., invasive species in Australia!) might not be included.

So I think that this would be quite complicated to modify this code block to only get sequences from species found in Australia. You would need to make an exhaustive list of all species found (and/or clades), and then look up the NCBI taxid for these and use those taxids instead of txid33208.

Something else you could try doing is follow the instructions at that tutorial to download a 12S sequence database, and then use taxonomic weights to instruct the classifier which species are more likely to be found in Australia, and downweight others). This would also be very complicated... you would need to create a feature table with all taxa found in your 12S database and the probability of finding these. Here is a tutorial describing how to do this with 16S (for which the process can be automated, because the class weights are based on existing observations) but for your use case you would need to manually assemble the class weights (as I assume that observation frequency data are lacking for 12S in Australia):

So unfortunately creating an Australia-only database will be heaps of work one way or another, as it will require manual curation of either the sequences or the class weights. If you do not have the time for this, just use the complete 12S database and then manually curate the assignments if you see any hits to species that should not be detected in Australia. This would be a faster but less specific approach.

Good luck!

1 Like

Hi @Tessa_M ,
I think you are looking for two changes to the original example R script:

  1. You want to select 12S sequences, whereas the original script obtained COI
  2. You want to restrict to a geographic range, whereas the original script did no such filtering

Assuming that is the case, I think you would modify a single line to the R script, such that:

  1. We modify the marker gene of interest from the original COI to the desired 12S
  2. We add in a filter for a region of interest.
## filter bold data function:
gatherBOLDdat_function <- function(theboldlist){
  do.call(rbind.data.frame, theboldlist) %>%
  filter(markercode == "12S", country="Australia") %>%
  select(sequenceID, processid, bin_uri, genbank_accession, nucleotides, country, institution_storing,
         phylum_name, class_name, order_name, family_name, genus_name, species_name)
}

I tested on the BOLD website whether the Australia [geo] parameter would yield information, and it certainly does (over 400k records across all markers), however I came up empty in my initial querying for a 12S record via the website. Not sure what I was missing when trying to apply the [marker] parameter using the BOLD website search field... nevertheless, following their API command, I was able to download all the sequences directly using this command in my web browser:

http://v4.boldsystems.org/index.php/API_Public/sequence?geo=Australia&marker=12s

... and it returned just 84 sequences :person_shrugging: ? Not sure if that's what you expected.

And actually, it's worse than you think! There are actually only eighteen sequences properly labeled with a 12S marker in the header of the resulting fasta file. As the BOLD API documentation indicates, your search results may need further refinement:

All markers for a specimen matching the search string will be returned.
ie. A record with COI-5P and ITS will return sequence data for both markers even if only COI-5P was specified.

Putting on a geographic condition comes with the risk of hoping that the person has applied such a label. At least from my cursory investigation in BOLD, the geographic-specificity requirement for Australia will generate loads of COI sequences for Australia, but only a paltry few for 12S.

Given there are so few of these sequences, I'm going to paste them below, but you should be able to reproduce this just by using pasting the http://v4.bolds... link I posted above, then examining the resulting full set of 84 sequences, and figuring out which sequences you want to retain after the fact.

Hope that gets you started - good luck!

>CONO1221-12|Hastulopsis amoena|12S|JQ808580
ACACGTTTCAGAGCCTAATTCAAATATTTATATATTCTAATTTACTTCCAAGTCCTCCTTATAACTTAACATACATCCATTATTTATCCGTCATTATAAATTATATAATTGTAACCCATCCTCCCCCCTTTATTAGCTGCACCTTGATTTGACATACTAAATTATCAATATTTTAATTGCTAACTTCTAGTTTCTAAAAAGTTCCCTGACGACAACGGTATACAAACTGAAAACAAAAAGAGGTCAGGTGCAACGTGGATTATCGATTATGAGACAGGTTCCCCTGGGTGGTCTAAAACACCGCCAAGTTCTTTGAGTTTTAAATTTTTAAACATTCAT
>CONO2426-19|Hastula brazieri|12S|MK586899
ACACGTTTCAGAGCCTTATTCAAATTATTTATATAACCTAATTTACTTTTAAGTCCGCCTTATAACTAATATACATTTCATATATTTATCCGTCATTATAACTTACATAATTGTAACCCATCCTCCCCCTTTCATTAGCTGCACCTTGATTTGACGTATTAAATCATTTCTATTTCCTATTGCTAACTCCTATTTTCTAAAAAGTTACCTGACGACAACGGTATACAAACTGAAAACAAGAAAAGGTCAGGTATAACGTGGATTATCGATTATGAGACAGGTTCCCCTAAGTGGTCTAAAACACCGCCAAGTCCTTTGAGTTTTAAATTTTTAGTATTCATAGTACTCNGGTAA
>CONO2427-19|Hastula brazieri|12S|MK586709
ACACGTTTCAGAGCCTTATTCAAATTATTTATATAACCTAATTTACTTTTAAGTCCGCCTTATAACTAATATACATTTCATATATTTATCCGTCATTATAACTTACATAATTGTAACCCATCCTCCCCCTTTCATTAGCTGCACCTTGATTTGACGTATTAAATCATTTCTATTTCCTATTGCTAACTCCTATTTTCTAAAAAGTTACCTGACGACAACGGTATACAAACTGAAAACAAGAAAAGGTCAGGTATAACGTGGATTATCGATTATGAGACAGGTTCCCCTAAGTGGTCTAAAACACCGCCAAGTCCTTTGAGTTTTAAATTTTTAGTATTCATAGTACTCNGGTAA
>CONO2428-19|Hastula brazieri|12S|MK586937
ACACGTTTCAGAGCCTTATTCAAATTATTTATATAACCTAATTTACTTTTAAGTCCGCCTTATAACTAATATACATTTCATATATTTATCCGTCATTATAACTTACATAATTGTAACCCATCCTCCCCCTTTCATTAGCTGCACCTTGATTTGACGTATTAAATCATTTCTATTTCCTATTGCTAACTCCTATTTTCTAAAAAGTTACCTGACGACAACGGTATACAAACTGAAAACAAGAAAAGGTCAGGTATAGCGTGGATTATCGATTATGAGACAGGTTCCCCTAAGTGGTCTAAAACACCGCCAAGTCCTTTGAGTTTTAAATTTTTAGTATTCATAGTACTCNGGTAA
>CONO2429-19|Hastula brazieri|12S|MK586930
ACACGTTTCAGAGCCTTATTCAAATTATTTATATAACCTAATTTACTTTTAAGTCCGCCTTATAACTAATATACATTTCATATATTTATCCGTCATTATAACTTACATAATTGTAACCCATCCTCCCCCTTTCATTAGCTGCACCTTGATTTGACGTATTAAATCATTTCTATTTCCTATTGCTAACTCCTATTTTCTAAAAAGTTACCTGACGACAACGGTATACAAACTGAAAACAAGAAAAGGTCAGGTATAACGTGGATTATCGATTATGAGACAGGTTCCCCTAAGTGGTCTAAAACACCGCCAAGTCCTTTGAGTTTTAAATTTTTAGTATTCATAGTACTCNGGTAA
>CONO2430-19|Hastula brazieri|12S|MK586848
ACACGTTTCAGAGCCTTATTCAAATTATTTATATAACCTAATTTACTTTTAAGTCCGCCTTATAACTAATATACATTTCATATATTTATCCGTCATTATAACTTATATAATTGTAACCCATCCTCCCCCTTTCATTAGCTGCACCTTGATTTGACGTATTAAATCATTTCTATTTCCTATTGCTAACTCCTATTTTCTAAAAAGTTACCTGACGACAACGGTATACAAACTGAAAACAAGAAAAGGTCAGGTATAACGTGGATTATCGATTATGAGACAGGTTCCCCTAAGTGGTCTAAAACACCGCCAAGTCCTTTGAGTTTTAAATTTTTAGTATTCATAGTACTCTGGTAA
>CONO2431-19|Hastula brazieri|12S|MK586861
ACACGTTTCAGAGCCTTATTCAAATTATTTATATAACCTAATTTACTTTTAAGTCCGCCTTATAACTAATATACATTTCATATATTTATCCGTCATTATAACTTATATAATTGTAACCCATCCTCCCCCTTTCATTAGCTGCACCTTGATTTGACGTATTAAATCATTTCTATTTCCTATTGCTAACTCCTATTTTCTAAAAAGTTACCTGACGACAACGGTATACAAACTGAAAACAAGAAAAGGTCAGGTATAACGTGGATTATCGATTATGAGACAGGTTCCCCTAAGTGGTCTAAAACACCGCCAAGTCCTTTGAGTTTTAAATTTTTAGTATTCATAGTACTCTGGTAA
>DIQT046-08|Telostylinus lineolatus|12S
ACATATTTTAGAGCTAAAATCAAAATATTTATCTTTATATTTTTACTATCAAATCCACTTTCAATAAATTTTTCATATTTATATTCATATAAATAATTTTATTGTAACCCATTTTTACTTAAACATAAACTACACCTTGATCTGATATAAAATTAAATATAAATTAACGAAAATTATTATTCTTATAAAATATTCTTATAACGACGGTATATAAATTGAAATACAAATTTAAGTAAGGTCCATCGTGGATTATCGATTAAAAAACAGGTTCCTCTGAATAGACTAAAATACCGCCAAATTTTTTAAGTTTCAAGAACATAACTAATACTACTTATATGTTTAAAAATACATTTTTAATAATAGGGTATCTAATCCTAGTTTTAAATAAAAATTTTTTAACTTCAATTAATAATATAAAAAATTATATTTAATTAAAATTTCACCTAATAATTAAACTTTAATTTTTATAAAAATAAATTTAATTAACATAAAAAAATTTTATTTGTGTTATTCGTATAACCGCG------------------
>DIQT089-08|Telostylinus lineolatus|12S
ACATATTTTAGAGCTAAAATCAAAATATTTATCTTTATATTTTTACTATCAAATCCACTTTCAATAAATTTTTCATATTTATATTCATATAAATAATTTTATTGTAACCCATTTTTACTTAAACATAAACTACACCTTGATCTGATATAAAATTAAATATAAATTAACGAAAATTATTATTCTTATAAAATATTCTTATAACGACGGTATATAAATTGAAATACAAATTTAAGTAAGGTCCATCGTGGATTATCGATTAAAAAACAGGTTCCTCTGAATAGACTAAAATACCGCCAAATTTTTTAAGTTTCAAGAACATAACTAATACTACTTATATGTTTAAAAATACATTTTTAATAATAGGGTATCTAATCCTAGTTTTAAATAAAAATTTTTTAACTTCAATTAATAATATAAAAAATTATATTTAATTAAAATTTCACCTAATAATTAAACTTTAATTTTTATAAAAATAAATTTAATTAACATAAAAAAATTTTATTTGTGTTATTCGTATAACCGCG------------------
>GMSPB615-18|Metopochetus impar|12S
ACATATTTTAGAGCTATAGTCAAATCATTAATCTATATAATTTTACTACCAAATCCATTTTCAATAAATTTTGCATATTTAAATCCACATAAATAATTTTATTGTAACCCATTTACACTTAAACATAAGCTACACCTTGATCTGATATACATTTTAATAAAAATATTAGAAAATTATTATTCTGATAAAATATTCTGATAACGACGGTATATAAACTGAAAACATATTTAAGAAAGGTCCATCGTGGATTATCGATTAAGAAACAGGTTCCTCTGAATAGACTAAAATACCGCCAAATTTTTTAAGTTTCAAGAACATAACTAATACTACCTTAGTAAATTAATACATTTTAAATAATAGGGTATCTAATCCTAGTTTATAATTAAAATTTCCAAGCTTCAATAAATTTAATTAATAAATTAAATAAATTTAAAATTTCACCTAATAAATTTATACTATATTTAAATTTCAATCATTTAACTCTTACCAATAAAATTTATTCGTATTATTCGTCTAACCGCG--------------------
>NEOGA1340-19|Cystiscus sp.|12S|MN322357
TACCAGAGGGTCAAATTATAGATTATAGGTAAGTGAAGTTATTAAGATGATTATTGGGACTTTTTAAGAAAAGGTGAAATTTAATTAAAGGATTASTTCTTAAGGATAASATATATGAATTCACGAAATCTATTGGAAAAACTGGGATTAGATACCCCATTATAGTAGACGTAAATATATTACTAGAGTACTACGAATAGAATTTAAAACTCAAAGAACTTGGCGGTGTCTTAGACTACCTAGGGGAACCTGTTTTGTAAGCGATAATCCACGTTGGATCTTACCTTCCCTGGTAATCAGTATGTATACCGTTGTCGTCAGGCAACTCTTGAGGATTGAAAAGTTGGCGACTTAAAAGTTAACTTATAATGTCAAATCAAGGTGCAGCTTATGGGGGGGTAGAAATGGGTTACATTAATAATATTATAGTGGAATTAGCTCTGAAATAGGCTATAGGAAATAGGACTTGGAAGTAAAGAGGGATATGTGAATGGTTTGAATATAGCTCTGGGACGTGT
>NEOGA1370-19|Hydroginella sp.|12S|MN322393
AAAACTATTTGATCAAGATATATTTGAGGGCAGTTTGTAGAAAAATATAAATAAAAATGTATATATGAATTTAATATATGTAAGTAAAATTGTATATATTAGGGAATTAAATAAAATACATGCTATTAAATCTACGAAAAATAAGGTAGAAACTAGGATTAGATACCCTGTTATTCTTATTCATAAAACTTCATATGCTGGGAGACTACGAGTGTTTAACTTAAAATCTAAAAGACTTGGCGGTATTTAAAACTTCTTAGGGGAGCTTGTTTCGTAATCGATAATCCACGTACTACCTGACTTATTTTATTAGCTTGTATATCGTCATCTTTAGTTAACTTCWWAGAAAAATAAAGTTAACGGAATAATTTACTTAAATTAATATGTTAGATCAAGATGCAGTTTATAAATAAGGGAAAATGAGCTACAATTATTATATTTATATTTCTAATAAATAAATTAAAATTTTTTTGAGGGAGGACTTAACAGTAAAATATTATATAAGAAATATCTTGAATATCTATTTTAAATATGC
>NEOGA1532-20|Dolicholatirus sp.|12S|MW057455
GTTAAACCAAGGGATTAAATTATATAAATACATGGCCTAAAAGACAGTTAGGTTTGTTTTTGATTTCGTGTTCATTCGTAAAAAGGTAAAATTTGAATACGAATTGTAAAATCGAGTGTAGTCAATTTACTGAAGCTGTGACAATCTAGAGGGAAACTGGGATTAGAGACCCCATTATTCTTGATTTTAAAGTTGATATAATGTATGCCAGAGCACTACGAACAAAATAGTTTAAAACTCAAAGGGCTTGGCGGTGTCTTAGACCTTTTAGGGGAACCTGTTTCATAATCGATAATCCACGTTAAACCTGACCTCCTTTTGCACTCAGTCTGTATACCGTCGTCGCCAGGTAACTTTCAAAAAACTAGAAGTTAGCTAGAAAATTATATAGATTAGAACGTCAGATTAAGGTGTAGCTAACAAGGAGGAGAAAATGGGTTACAATTATATATTTATAATTACAGACTATTATTTGAAACAATAAATATGAAGGAGGACTTAAAAGTAAAGCATAAATTATATAAGTAGCCTGAATAAGGCTCTGAGACGTGC
>NEOGA1556-20|Tasmeuthria clarkei|12S|MW057469
GTTAGACCAAGAGATTAAGTTATATTCCTAGGTAAAAAGACAGTTAGGTTTAAAATAATTTAGTTTATTGATCATTTATATAAAAGTAAAATTTATATATAAATAGTTAATTTAATCGTAGCTTTTTACTGAGGCTGTGACAGTCCTGAGGGAAACTGGGATTAGATACCCCATTATTCTTGACTGTAAATCTAATTAAATTTACCAGAGTACTATGAATCTAAATAAAAATTTAAAACTCAAAGAGCTTGGCGGTGTTTTAGACCTATTAGGGGAACCTGTCTCATAATCGACAATCCGCGCTAGACCTAACCCTGTTTTGTAACCAGTTTGTATACCGTCGTCGTCAGGTAACTTTTAAAAATTAAGAAGTTAGCAACAATAATTTTTAAATTTAAACGTCAGATCAAGGTGCAGCTAATAAAAGGGAGAAGATGGGTTACAATTATTTCACTTATAGCTACGAAAAATTTTATGAAAATATAATTAGAAGGAGGACTTGAAAGTAAAATATAATATATAAGCAATTTGAATATGGCTCTGAAACGTGC
>NSWHP4284-19|Pseudopomyzidae|12S
ACATATTTTAGAGCTAAAGTCAAACTATTAATCTTTATAGTTTTACTACCAAATCCACTTTCAGTACATTTTTCATAATTACATCCATTTAAATAATTTTATTGTAATCCATTTCTACTTAAACATAAACTACACCTTGATCTGATATATAATTTAATAAAATTTTTTGAAAATTATTTTTCTTATAAAATATTCTAATAACGACGGTATATAAATTGAAAAACAAATTTAAGTAAGGTCCAACGTGGATTATCGATTACAGAACAGATTCCTCTGAATAGACTAAAATACCGCCAAATTTTTTAAGTTTCAAGAACATATCTATTACTACCTAAGTAACTTGTATTTACATTTTTAATAATAGGGTATCTAATCCTAGTTTTTTATAAAAATTTTTAAGCTTCAATAAATTTAACTATAAAAATTATATAATTTTAAAATTTCACCTAATAAAATTAATTTAATTTTAAAATATACAATTTAACTTTTACTAAAAAAATTTATTTGCATTATTCGTATAACCGCG----------------
>TONO230-18|Akibumia orientalis|12S|MH571233
AGAGATCAAGTTATATTTGTTAAGGTAAAAAGGTAGTTAGATACAAGTGTTTATTAGTTTACTAATTTTTTATATAAAAGTAAAATTTGTATATAAATAAATAACTTAGGGTAAACTAATTATATTGATGCTGCGATAGCTTTAAGGGAAACTGGGATTAGATACCCCATTATTTTTAGTTGTAAATAAATAAGAATTTACCGGAGTACTATGAATTTTTTAAAAATTTAAAACTCAAAGGACTTGGCGGTGTTTTAGACCTCTCAGGGGAACCTGTCTCGTAATCGACAATCCGCGTTAAACCTAACCTTTTTTTGCATCTCAGTTTGTATACCGTCGTCGTCAGGTAACTTTTTAAAAATTAGAAGTTGGCAATAAAATTAATATTAATTTAAACGTCAGATCAAGGTGCAGCTAATATAAAGGTGAGGATGGGTTACAATTAAAATTTATAATTACGGATATAATAATGAAATATTTATTTTAATGAAGGAGGACTTGAAAGTAAGATAATTATATAAAAATAATTTGAATTAGGCTCTGAAACATGC
>ZSMDB056-15|Lancetes lanceolatus|12S|KT607937
-----TTTAAATGT-AAAAAAAAATATCAAATTATTATTAGTTAAGTTCTTTAAATTTAAAAATTTTGGCGGTATTTTAGTCTATTCAGAGGAACCTGTTCTGTAATTGATAATCCACGATTAATTATACTTATTTT----TTTAATTTGTATATCGTTGTTTATAAATAATTTTATAAGAA-AATAAATTTTTAAGATTTTAGATAAAAAAATATATCAAATCAAGGTGCAGTTTATAGATAAGGA--GAAATGGGTTACAAT-AAATTTATTTAAA--CGGATTAATTTTTAAAATA--AGATTATAAAGGTGGATTTGATAGTAAT--TAAATTAATTTTAATTTAATGA-TTTTAGCTCTAAAATATGT
>ZSMDB099-16|Rhantus simulans|12S
TTTAAATGTAAATTATTATACTAAAGTAGTAATAGTTAAGTTCTTTAAATTTAAAGATTTTGGCGGTATTTTAGTCTATTCAGAGGAACCTGTTCTGTAATTGATAGTCCACGATTAATTTTACTTAATTTAATAATTTGTATATCGTCGTTTATAAATGATTTTAAAAGAATTTAAATTTTTAAGATTTTTTATTAAAAAATATATCAGATCAAGGTGCAGTTAATGATTAAGGAGAAATGGGTTACAATAAATTTATTTATATGGATTAATATATAAAATTGTATTATGAAGGAGGATTTGATAGTAATAAAATTAAATTAAATTTTATGATTTTAGCTCTAAAATATGT
2 Likes

Thanks, Devon. Only 84 (18?) sequences seems like very few. But I do want to try this for COI too, and perhaps there'll be more.

A different question, when I'm looking at building a database for COI, is there any detriment to seeking to download only arthropod sequences in the first place? Rather than follow your COI database tutorial all the way through and download all sequences, then cut some out to match anml.

For context, I'm doing a dietary metabarcoding study. My species is omnivorous so I've used 12S-v5, anml, and trnl to amplify vertebrates, invertebrates/arthropods and plants respectively. Hence I'm trying to build a few databases.

Thanks for your thoughtful reply, Nicholas. It sounds like it might be beyond my time and coding capabilities to try to tailor the download to Australian taxa. But the thing you mention at the end, "manually curate the assignments"... do you have more information about how to do that? I have made a classifier using all NCBI 12s sequences and apparently my Australian 35-gram omnivorous species is consuming pumas...

Hey Nicholas, I've had a play around and I think I can search for Australian taxa, or at least Australian-based research, in the Entrez with Australia[Text Word]. There's plenty of entries that match and mostly, if not all, Australian taxa. I might have to manually add a few in, e.g., Mus musculus, but otherwise this is looking a lot more useful for me.

But I'm still interested in the manually curating the assignments, if you can give me some more info. Thanks!

1 Like

Oh my that's amazing! Maybe you have accidentally collected stool from a drop bear?

Suspicious classifications can be checked by using BLAST and checking alignments etc.

Great to hear that this pulls up many relevant hits! Though I worry what else you might miss when doing this. And, e.g., invasive species might not be included. But for sure this will speed up the process of compiling a relevant reference database!

Good luck!

1 Like