Building a COI database from NCBI references

:construction: :stop_sign::construction: :stop_sign::construction: :stop_sign::construction: :stop_sign::construction: :stop_sign:

A proper software announcement is in progress. For now, if you use these COI resources in your research, please contact me directly so I can give you the appropriate citation: [email protected]

Preamble

I’ve been building various COI databases for my own research for the past few years using reference sequences from the Barcode of Life Data system (BOLD). My earlier attempts used resources outside of a QIIME environment to generate these databases: see one such paper using these methods here and the associated GitHub repo describing how these databases were constructed. Because I used QIIME for everything following that database construction (sequence processing, classification, diversity analyses, etc.), it was wonderful news when I heard about a series of tools being developed to build databases directly within QIIME - thanks RESCRIPt developers!. As a result, I recently wrote a tutorial in this forum describing how to build a BOLD COI database within QIIME. Nevertheless, the initial acquisition of BOLD references, plus several pre-processing steps, required steps outside of QIIME.

Fortunately, those savvy RESCRIPt developers offered another way to gather COI sequences - this time from NCBI GenBank instead of BOLD. Even better, they laid out a comprehensive tutorial on how to do it for a variety of data types. All that was missing on my end was applying their roadmap to get COI-specific data.

This tutorial is therefore a brief guide for users hoping to build a COI database from NCBI references instead of BOLD. As with the BOLD COI tutorial, I’ve also added some info on how you could uses these data to build primer-coordinate-specific databases.

Gathering COI data from NCBI

Gathering NCBI data using RESCRIPt is as easy as typing the get-ncbi-data function. However, the not-so-easy part of gathering NCBI data is that you need to consider what kind of filtering parameters you want to use. Fortunately, much more experienced users have suggested a series of parameters which were used in my data gathering tasks. Notably, I grouped COI data into two file sets because I wanted to see if there were any difference between data from NCBI GenBank that contained a term that indicates they are cross-referenced with BOLD references: "BARCODE"[KYWD]. Thus, when I was gathering data, my two search terms differed using a modified boolean operator (AND vs. NOT) on that search parameter:

## gather BOLD-tagged COI data from NCBI:
qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND "BARCODE"[KYWD])' \
--output-dir NCBIdata_BOLDonly

## gather nonBOLD-tagged COI data fron NCBI:
qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) NOT "BARCODE"[KYWD])' \
--output-dir NCBIdata_notBOLD

It’s worth considering additional filtering parameters for your own research. Additional terms may include things like:

  • AND mitochondrion[Filter]
  • NOT environmental sample[Title]
  • NOT environmental samples[Title]
  • NOT environmental[Title] -
  • NOT uncultured[Title]
  • NOT unclassified[Title]
  • NOT unidentified[Title]
  • NOT unverified[Title]

Short cut worth considering

One last thought in this initial data gathering step - notice I didn’t specify any phylogenetic group? Thus, while the data I ultimately was analyzing contained only particular animal taxa, this initial raw data contains COI references for lots of different organisms - here’s the Kingdom/Domain names listed in just the NCBIdata_notBOLD raw taxa:

k__Archaea;
k__Bacteria;
k__Bamfordvirae;
k__Environmental
k__Eukaryota;
k__Fungi;
k__Metazoa;
k__Orthornavirae;
k__Pararnavirae;
k__Viridiplantae;
k__Viruses;

I gathered all the raw data for a few different reasons - partly just to see what the heck was going to get pulled in from NCBI if I didn’t specify certain taxa. Viral COI… who knew!? Nevertheless, if you have a specific taxonomic group in mind, you can save yourself a lot of time and disk space by specifying the groups of organisms. To do that, just look up the NCBI taxonomy ID for your group in question, and add --p-query txid####[ORGN] to the query search term (where #### represents the ID value in question).

For example, if I wanted to build an arthropod-specific COI database from NCBI sequence, I’d run this command:

qiime rescript get-ncbi-data \
--p-query 'txid6656[ORGN] AND (cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title])' \
--output-dir NCBIdata_ArthOnly

Filtering COI data from NCBI

Filtering for length and junk

I used a pair of RESCRIPt commands to filter each data set: cull-seqs to remove sequences with 5 or more degenerate nucleotides and homopolymer sequences at least 12 characters in length; filter-seqs-length to retain only sequences with (ungapped) lengths between 250 and 1600 bp. These parameters were applied to match those used in my earlier work with building a BOLD COI database. See the BOLD tutorial for an explanation on how you might want to investigate your own dataset to modify these particular parameter values.

The same parameters were applied for both NCBI data sets - I’m showing only one example here for the NCBIdata_notBOLD data.

qiime rescript cull-seqs \
    --i-sequences ./NCBIdata_notBOLD/NCBIdata_notBOLD_seq.qza \
    --p-num-degenerates 5 \
    --p-homopolymer-length 12 \
    --o-clean-sequences NCBIdata_notBOLD_ambi_hpoly_filtd_seqs.qza

qiime rescript filter-seqs-length \
    --i-sequences NCBIdata_notBOLD_ambi_hpoly_filtd_seqs.qza \
    --p-global-min 250 \
    --p-global-max 1600 \
    --o-filtered-seqs NCBIdata_notBOLD_ambi_hpoly_length_filtd_seqs.qza \
    --o-discarded-seqs NCBIdata_notBOLD_ambi_hpoly_length_discarded_seqs.qza

I ran into a challenge when filtering sequences initially because an earlier version of RESCRIPt’s get-ncbi-data had a quirk where there could be a discrepancy between the number of references with sequences and the number of references with valid taxonomy names (i.e. empty taxonomy names) - see this post describing the problem. I was able to resolve this problem by using the taxonomy file itself as the input “metadata” file in a qiime feature-table filter-seqs command; suffice it to say, this is not a problem that will be encountered now, as the developer of the tool has resolved the issue.

Filtering for redundancy and possible ambiguity (dereplicating)

Each filtered data set was then dereplicated using RESCRIPt (showing notBOLD data only here as an example):

qiime rescript dereplicate --verbose \
  --i-sequences NCBIdata_notBOLD_ambi_hpoly_length_filtd_seqs.qza \
  --i-taxa ./NCBIdata_notBOLD/NCBIdata_notBOLD_taxa.qza \
  --p-mode 'super' \
  --p-derep-prefix \
  --p-rank-handles 'silva' \
  --o-dereplicated-sequences NCBIdata_notBOLD_derep1_seqs.qza \
  --o-dereplicated-taxa NCBIdata_notBOLD_derep1_taxa.qza

Further refining the reference sequences: creating primer-coordinate defined databases

In addition to generating these datasets, I wanted to also build databases that were trimmed using primer-coordinates specific to the amplicon region I use in my research. While the ANML primers I used in my research are specific to my amplicon data, these steps can be modified to whatever primer sequences you use in your own research. These steps were completed using functions outside of QIIME/RESCRIPt, and have already been described at length in the earlier BOLD tutorial (starting at Step 4). I show the same steps here, but include far fewer comments - see the BOLD tutorial for all the gory details. One other point - I’m showing the steps taking for an arbitrary reference collection (hence the $PFXNAME prefix) - the same steps are applied to both NCBI data set with similar commands.

  • The primer fasta file for coordinate trimming, anml_primers.fasta, is available here.
  • The Python script to trim the coordinates, , is avaiable here. This file was created by a COI wizard, Mike Robeson, and is just one part of a more comprehensive pipeline he has documented here.
  1. Export the sequence and taxonomy files
## In my particular case, "$PFXNAME" could represent either the "NCBIdata_notBOLD" or "NCBIdata_BOLDonly" groups...
qiime tools export --input-path "$PFXNAME"_derep1_seqs.qza --output-path derep1seqs_"$PFXNAME"
qiime tools export --input-path "$PFXNAME"_derep1_taxa.qza --output-path derep1taxa_"$PFXNAME"
  1. Create a subset of sequences for generating a reference alignment:
  • filter available seqs by length
seqkit seq --min-len 660 --remove-gaps --max-len 1000 -w 0 ./derep1seqs_"$PFXNAME"/dna-sequences.fasta > "$PFXNAME"_lengthFiltd_seqs.fasta
  • generate a list of only particular taxa to make initial reference alignment with fewer gaps (I’ve arbitrarily selected only those with species-named taxa from the order Lepidoptera)
grep 'p__Arthropoda; c__Insecta; o__Lepidoptera;' ./derep1taxa_"$PFXNAME"/taxonomy.tsv | \
grep -v 's__$' | grep -v 's__.*sp.$' | grep -v '-' | cut -f 1 > "$PFXNAME"_species.list
  • filter the sequences to include only those in the .list
seqkit grep --pattern-file "$PFXNAME"_species.list "$PFXNAME"_lengthFiltd_seqs.fasta > "$PFXNAME"_lengthNtaxaFiltd_seqs.fasta
  • keep only those with no ambiguous bases
seqkit fx2tab -n -i -a "$PFXNAME"_lengthNtaxaFiltd_seqs.fasta | awk '{if ($2 == "ACGT") print $1}' > $SEQDIR/"$PFXNAME"_nonAmbig_featureIDs.txt
seqkit grep --pattern-file "$PFXNAME"_nonAmbig_featureIDs.txt "$PFXNAME"_lengthNtaxaFiltd_seqs.fasta -w 0 > "$PFXNAME"_lengthNtaxaNambigFiltd_seqs.fasta
  • subsample to 2,000 sequences to build reference alignment
seqkit sample --rand-seed 101 --number 2000 --two-pass "$PFXNAME"_lengthNtaxaNambigFiltd_seqs.fasta -w 0 > "$PFXNAME"_mafftRefs_subseqs_seqs.fasta
  1. Create a reference alignment:
export MAFFT_TMPDIR=$(pwd)   ## change this path to suit your needs
mafft --auto --thread -1 "$PFXNAME"_mafftRefs_subseqs_seqs.fasta > "$PFXNAME"_reference_MSA
  1. Align the primers to the small alignment and identify the coordinate positions:
mafft --multipair --addfragments anml_primers.fasta --keeplength --thread -1 --mapout --reorder "$PFXNAME"_reference_MSA > "$PFXNAME"_primer_MSA
  1. Align the remaining sequences to the reference alignment file:
  • create a list of the sequences to drop:
grep '^>' "$PFXNAME"_mafftRefs_subseqs_seqs.fasta | sed 's/^>//' > "$PFXNAME"_droplist.txt
  • filter OUT the sequences to include only those NOT in the droplist.txt file
seqkit grep -v --pattern-file "$PFXNAME"_droplist.txt ./derep1seqs_"$PFXNAME"/dna-sequences.fasta > "$PFXNAME"_seqsForFullAlignment.fasta
  • generate the full alignment
export MAFFT_TMPDIR=$(pwd)   ## change this path to suit your needs (in case you forgot the first time!)
mafft --auto --addfull "$PFXNAME"_seqsForFullAlignment.fasta --keeplength --thread -1 "$PFXNAME"_primer_MSA > "$PFXNAME"_full_MSA
  1. Run the Python script to create a fasta file with only the primer-coordinate positions remaining in the alignment. Your values for “-s” and “-e” will differ here if using different input data! Check the .map file!!
python extract_alignment_region.py \
  -i "$PFXNAME"_full_MSA \
  -o "$PFXNAME"_coordinateTrimmed_MSA.fasta \
  -s 690 \
  -e 1049
  1. Remove gap characters, filter for minimum and maximum lengths, and drop the primer sequences. We’re specifying a maximum and minimum sequence length in this filtering step so that the remaining reference sequences from NCBI match the same lengths as available in the BOLD COI reference set. I stumbled across a quirk specific only to those COI sequences gathered from NCBI that did not contain the "BARCODE" keyword - the NCBIdata_notBOLD data. There are a large number (tens of thousdands) of sequences with a length around 250 bp - it’s still not clear why this is the case, but I wrote a few of these observations in a separate QIIME forum post here.
  • make a list of sequence headers to drop:
grep PRIMER "$PFXNAME"_coordinateTrimmed_MSA.fasta | sed 's/>//' | sed 's/[ \t]*$//' > dropPrimers.txt
  • exclude these sequences from the alignment file
seqkit grep -v -f dropPrimers.txt "$PFXNAME"_coordinateTrimmed_MSA.fasta | \
seqkit seq --upper-case -w 0 -g --max-len 207 --min-len 170 > "$PFXNAME"_derep1_anml.fasta
  1. Import the primer-coordinate-trimmed reference sequences back as a QIIME object:
qiime tools import --input-path "$PFXNAME"_derep1_anml.fasta --output-path "$PFXNAME"_derep1_anml_seqs.qza --type 'FeatureData[Sequence]'
  1. Dereplicate these sequences a second time:
qiime rescript dereplicate \
  --i-sequences "$PFXNAME"_derep1_anml_seqs.qza \
  --i-taxa "$PFXNAME"_derep1_taxa.qza \
  --p-mode 'super' --p-derep-prefix --p-rank-handles 'silva' \
  --o-dereplicated-sequences "$PFXNAME"_anml_seqs.qza \
  --o-dereplicated-taxa "$PFXNAME"_anml_taxa.qza

So … how do these NCBI data compare to BOLD COI references?

For that answer, you’re going to have to wait to check out the forthcoming manuscript. More details coming shortly!

3 Likes