Building a COI database from NCBI references

devonorourke · September 14, 2020, 5:47pm

Citation:

If you use the following COI resources or RESCRIPt for COI database preparation, please cite the following:

Michael S Robeson II, Devon R O’Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. RESCRIPt: Reproducible sequence taxonomy reference database management for the masses. bioRxiv 2020.10.05.326504; doi: https://doi.org/10.1101/2020.10.05.326504

Preamble

I've been building various COI databases for my own research for the past few years using reference sequences from the Barcode of Life Data system (BOLD). My earlier attempts used resources outside of a QIIME environment to generate these databases: see one such paper using these methods here and the associated GitHub repo describing how these databases were constructed. Because I used QIIME for everything following that database construction (sequence processing, classification, diversity analyses, etc.), it was wonderful news when I heard about a series of tools being developed to build databases directly within QIIME - thanks RESCRIPt developers!. As a result, I recently wrote a tutorial in this forum describing how to build a BOLD COI database within QIIME. Nevertheless, the initial acquisition of BOLD references, plus several pre-processing steps, required steps outside of QIIME.

Fortunately, those savvy RESCRIPt developers offered another way to gather COI sequences - this time from NCBI GenBank instead of BOLD. Even better, they laid out a comprehensive tutorial on how to do it for a variety of data types. All that was missing on my end was applying their roadmap to get COI-specific data.

This tutorial is therefore a brief guide for users hoping to build a COI database from NCBI references instead of BOLD. As with the BOLD COI tutorial, I've also added some info on how you could uses these data to build primer-coordinate-specific databases.

Gathering COI data from NCBI

Gathering NCBI data using RESCRIPt is as easy as typing the get-ncbi-data function. However, the not-so-easy part of gathering NCBI data is that you need to consider what kind of filtering parameters you want to use. Fortunately, much more experienced users have suggested a series of parameters which were used in my data gathering tasks. Notably, I grouped COI data into two file sets because I wanted to see if there were any difference between data from NCBI GenBank that contained a term that indicates they are cross-referenced with BOLD references: "BARCODE"[KYWD]. Thus, when I was gathering data, my two search terms differed using a modified boolean operator (AND vs. NOT) on that search parameter:

## gather BOLD-tagged COI data from NCBI:
qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND "BARCODE"[KYWD])' \
--output-dir NCBIdata_BOLDonly

## gather nonBOLD-tagged COI data fron NCBI:
qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) NOT "BARCODE"[KYWD])' \
--output-dir NCBIdata_notBOLD

It's worth considering additional filtering parameters for your own research. Additional terms may include things like:

AND mitochondrion[Filter]
NOT environmental sample[Title]
NOT environmental samples[Title]
NOT environmental[Title] -
NOT uncultured[Title]
NOT unclassified[Title]
NOT unidentified[Title]
NOT unverified[Title]

Short cut worth considering

One last thought in this initial data gathering step - notice I didn't specify any phylogenetic group? Thus, while the data I ultimately was analyzing contained only particular animal taxa, this initial raw data contains COI references for lots of different organisms - here's the Kingdom/Domain names listed in just the NCBIdata_notBOLD raw taxa:

k__Archaea;
k__Bacteria;
k__Bamfordvirae;
k__Environmental
k__Eukaryota;
k__Fungi;
k__Metazoa;
k__Orthornavirae;
k__Pararnavirae;
k__Viridiplantae;
k__Viruses;

I gathered all the raw data for a few different reasons - partly just to see what the heck was going to get pulled in from NCBI if I didn't specify certain taxa. Viral COI... who knew!? Nevertheless, if you have a specific taxonomic group in mind, you can save yourself a lot of time and disk space by specifying the groups of organisms. To do that, just look up the NCBI taxonomy ID for your group in question, and add --p-query txid####[ORGN] to the query search term (where #### represents the ID value in question).

For example, if I wanted to build an arthropod-specific COI database from NCBI sequence, I'd run this command:

qiime rescript get-ncbi-data \
--p-query 'txid6656[ORGN] AND (cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title])' \
--output-dir NCBIdata_ArthOnly

Filtering COI data from NCBI

Filtering for length and junk

I used a pair of RESCRIPt commands to filter each data set: cull-seqs to remove sequences with 5 or more degenerate nucleotides and homopolymer sequences at least 12 characters in length; filter-seqs-length to retain only sequences with (ungapped) lengths between 250 and 1600 bp. These parameters were applied to match those used in my earlier work with building a BOLD COI database. See the BOLD tutorial for an explanation on how you might want to investigate your own dataset to modify these particular parameter values.

The same parameters were applied for both NCBI data sets - I'm showing only one example here for the NCBIdata_notBOLD data.

qiime rescript cull-seqs \
    --i-sequences ./NCBIdata_notBOLD/NCBIdata_notBOLD_seq.qza \
    --p-num-degenerates 5 \
    --p-homopolymer-length 12 \
    --o-clean-sequences NCBIdata_notBOLD_ambi_hpoly_filtd_seqs.qza

qiime rescript filter-seqs-length \
    --i-sequences NCBIdata_notBOLD_ambi_hpoly_filtd_seqs.qza \
    --p-global-min 250 \
    --p-global-max 1600 \
    --o-filtered-seqs NCBIdata_notBOLD_ambi_hpoly_length_filtd_seqs.qza \
    --o-discarded-seqs NCBIdata_notBOLD_ambi_hpoly_length_discarded_seqs.qza

I ran into a challenge when filtering sequences initially because an earlier version of RESCRIPt's get-ncbi-data had a quirk where there could be a discrepancy between the number of references with sequences and the number of references with valid taxonomy names (i.e. empty taxonomy names) - see this post describing the problem. I was able to resolve this problem by using the taxonomy file itself as the input "metadata" file in a qiime feature-table filter-seqs command; suffice it to say, this is not a problem that will be encountered now, as the developer of the tool has resolved the issue.

Filtering for redundancy and possible ambiguity (dereplicating)

Each filtered data set was then dereplicated using RESCRIPt (showing notBOLD data only here as an example):

qiime rescript dereplicate --verbose \
  --i-sequences NCBIdata_notBOLD_ambi_hpoly_length_filtd_seqs.qza \
  --i-taxa ./NCBIdata_notBOLD/NCBIdata_notBOLD_taxa.qza \
  --p-mode 'super' \
  --p-derep-prefix \
  --p-rank-handles 'silva' \
  --o-dereplicated-sequences NCBIdata_notBOLD_derep1_seqs.qza \
  --o-dereplicated-taxa NCBIdata_notBOLD_derep1_taxa.qza

Further refining the reference sequences: creating primer-coordinate defined databases

In addition to generating these datasets, I wanted to also build databases that were trimmed using primer-coordinates specific to the amplicon region I use in my research. While the ANML primers I used in my research are specific to my amplicon data, these steps can be modified to whatever primer sequences you use in your own research. These steps were completed using functions outside of QIIME/RESCRIPt, and have already been described at length in the earlier BOLD tutorial (starting at Step 4). I show the same steps here, but include far fewer comments - see the BOLD tutorial for all the gory details. One other point - I'm showing the steps taking for an arbitrary reference collection (hence the $PFXNAME prefix) - the same steps are applied to both NCBI data set with similar commands.

The primer fasta file for coordinate trimming, anml_primers.fasta, is available here.
The Python script to trim the coordinates, , is avaiable here. This file was created by a COI wizard, Mike Robeson, and is just one part of a more comprehensive pipeline he has documented here.

Export the sequence and taxonomy files

## In my particular case, "$PFXNAME" could represent either the "NCBIdata_notBOLD" or "NCBIdata_BOLDonly" groups...
qiime tools export --input-path "$PFXNAME"_derep1_seqs.qza --output-path derep1seqs_"$PFXNAME"
qiime tools export --input-path "$PFXNAME"_derep1_taxa.qza --output-path derep1taxa_"$PFXNAME"

Create a subset of sequences for generating a reference alignment:

filter available seqs by length

seqkit seq --min-len 660 --remove-gaps --max-len 1000 -w 0 ./derep1seqs_"$PFXNAME"/dna-sequences.fasta > "$PFXNAME"_lengthFiltd_seqs.fasta

generate a list of only particular taxa to make initial reference alignment with fewer gaps (I've arbitrarily selected only those with species-named taxa from the order Lepidoptera)

grep 'p__Arthropoda; c__Insecta; o__Lepidoptera;' ./derep1taxa_"$PFXNAME"/taxonomy.tsv | \
grep -v 's__$' | grep -v 's__.*sp.$' | grep -v '-' | cut -f 1 > "$PFXNAME"_species.list

filter the sequences to include only those in the .list

seqkit grep --pattern-file "$PFXNAME"_species.list "$PFXNAME"_lengthFiltd_seqs.fasta > "$PFXNAME"_lengthNtaxaFiltd_seqs.fasta

keep only those with no ambiguous bases

seqkit fx2tab -n -i -a "$PFXNAME"_lengthNtaxaFiltd_seqs.fasta | awk '{if ($2 == "ACGT") print $1}' > $SEQDIR/"$PFXNAME"_nonAmbig_featureIDs.txt
seqkit grep --pattern-file "$PFXNAME"_nonAmbig_featureIDs.txt "$PFXNAME"_lengthNtaxaFiltd_seqs.fasta -w 0 > "$PFXNAME"_lengthNtaxaNambigFiltd_seqs.fasta

subsample to 2,000 sequences to build reference alignment

seqkit sample --rand-seed 101 --number 2000 --two-pass "$PFXNAME"_lengthNtaxaNambigFiltd_seqs.fasta -w 0 > "$PFXNAME"_mafftRefs_subseqs_seqs.fasta

Create a reference alignment:

export MAFFT_TMPDIR=$(pwd)   ## change this path to suit your needs
mafft --auto --thread -1 "$PFXNAME"_mafftRefs_subseqs_seqs.fasta > "$PFXNAME"_reference_MSA

Align the primers to the small alignment and identify the coordinate positions:

mafft --multipair --addfragments anml_primers.fasta --keeplength --thread -1 --mapout --reorder "$PFXNAME"_reference_MSA > "$PFXNAME"_primer_MSA

Align the remaining sequences to the reference alignment file:

create a list of the sequences to drop:

grep '^>' "$PFXNAME"_mafftRefs_subseqs_seqs.fasta | sed 's/^>//' > "$PFXNAME"_droplist.txt

filter OUT the sequences to include only those NOT in the droplist.txt file

seqkit grep -v --pattern-file "$PFXNAME"_droplist.txt ./derep1seqs_"$PFXNAME"/dna-sequences.fasta > "$PFXNAME"_seqsForFullAlignment.fasta

generate the full alignment

export MAFFT_TMPDIR=$(pwd)   ## change this path to suit your needs (in case you forgot the first time!)
mafft --auto --addfull "$PFXNAME"_seqsForFullAlignment.fasta --keeplength --thread -1 "$PFXNAME"_primer_MSA > "$PFXNAME"_full_MSA

Run the Python script to create a fasta file with only the primer-coordinate positions remaining in the alignment. Your values for "-s" and "-e" will differ here if using different input data! Check the .map file!!

python extract_alignment_region.py \
  -i "$PFXNAME"_full_MSA \
  -o "$PFXNAME"_coordinateTrimmed_MSA.fasta \
  -s 690 \
  -e 1049

Remove gap characters, filter for minimum and maximum lengths, and drop the primer sequences. We're specifying a maximum and minimum sequence length in this filtering step so that the remaining reference sequences from NCBI match the same lengths as available in the BOLD COI reference set. I stumbled across a quirk specific only to those COI sequences gathered from NCBI that did not contain the "BARCODE" keyword - the NCBIdata_notBOLD data. There are a large number (tens of thousdands) of sequences with a length around 250 bp - it's still not clear why this is the case, but I wrote a few of these observations in a separate QIIME forum post here.

make a list of sequence headers to drop:

grep PRIMER "$PFXNAME"_coordinateTrimmed_MSA.fasta | sed 's/>//' | sed 's/[ \t]*$//' > dropPrimers.txt

exclude these sequences from the alignment file

seqkit grep -v -f dropPrimers.txt "$PFXNAME"_coordinateTrimmed_MSA.fasta | \
seqkit seq --upper-case -w 0 -g --max-len 207 --min-len 170 > "$PFXNAME"_derep1_anml.fasta

Import the primer-coordinate-trimmed reference sequences back as a QIIME object:

qiime tools import --input-path "$PFXNAME"_derep1_anml.fasta --output-path "$PFXNAME"_derep1_anml_seqs.qza --type 'FeatureData[Sequence]'

Dereplicate these sequences a second time:

qiime rescript dereplicate \
  --i-sequences "$PFXNAME"_derep1_anml_seqs.qza \
  --i-taxa "$PFXNAME"_derep1_taxa.qza \
  --p-mode 'super' --p-derep-prefix --p-rank-handles 'silva' \
  --o-dereplicated-sequences "$PFXNAME"_anml_seqs.qza \
  --o-dereplicated-taxa "$PFXNAME"_anml_taxa.qza

So ... how do these NCBI data compare to BOLD COI references?

For that answer, you're going to have to wait to check out the forthcoming manuscript. More details coming shortly!

Margreet · November 19, 2020, 12:09pm

This sounds great! I am very curious how these NCBI data will be compared to the BOLD COI references.

I just got a small question: up till now I have used taxonomic classifiers created by you (such as the `bold_full_ArthOnly_classifier.qza'). This worked very good! However, I am wondering if the NCBI data will be more extensive.. Is there a reason why you should choose for the COI BOLD references or the COI GenBank (NCBI) references?

Thanks in advance!

Margreet

devonorourke · November 19, 2020, 1:35pm

Hi @Margreet,

I attempted to evaluate these questions in this manuscript:

How these databases compare depends on what kind of taxonomic group you're working with, how you go about filtering/trimming the reference sequences, and whether or not you cluster them. There's not one simple answer, unfortunately. Nevertheless, I don't see any reason to fault a user that uses either NCBI or BOLD to build their databases. In fact, I think the next step would be to try to combine their resources.

colinbrislawn · December 30, 2023, 11:46pm

An off-topic reply has been split into a new topic: building a (newer!) COI database from NCBI

Please keep replies on-topic in the future.