Building a COI database from BOLD references

Feel free to send me a direct message if you want to have a longer discussion on how to (potentially) set up a cloud compute option to get the job done. I’m no longer actively involved in COI research, but am happy to chat if you think there are certain tasks you need more info about - especially if it’s for something you’re tackling out in the Berkshires (unless you’re calling Worcester “western” Mass…) :slight_smile:

1 Like

Alright, here are the results. I did end up figuring out how to train a new version of bold_anml_classifier.qza. It is in the linked folder at the bottom and named bold_anml_classifier2.qza to avoid confusion, though it should be exactly the same as the original. This should be usable on QIIME2v2021.2.0 and feature-classifierv2021.2.0. Can’t make any promises past that. More details on how I trained it using less RAM are below (tl;dr use feature-classifier instead of rescript).

I got a little carried away and did comparisons between the classifiers trained on the full anml dataset (anml, ~740,000 records), the anml dataset filtered for all records listing “United States” or “Canada” as their location (anmlUSCA, ~300,000 records), and two (untrimmed) training datasets I had already compiled. The untrimmed datasets aren’t a perfect comparison since they were downloaded and filtered differently, but I think they’re at least a little informative. They were downloaded directly from the BOLD online database in March 2021 and then filtered with a Python script I’ll link at the bottom. The main filters are: remove duplicate records, remove any records with unallowed characters, remove records that aren’t COI-5P. The two downloaded datasets are all records from a search for “United States” & “Canada” (USCA, ~200,000 records) and all records from a search for a list of eastern states and provinces, which actually is a larger dataset (EastUSCAOnt, ~280,000 records). Exact search terms for each are linked below. I had to download some in chunks, which possibly affects what records were included. All comparisons are made based on a ~1000 sample library of bird fecal samples (and blanks from both DNA extraction & PCR) denoised with dada2 (~3500 ASVs). COI was amplified using ZBJ primers. For each naive-Bayes classifier, I ran them at a confidence threshold of 0.7, 0.5, and 0.3. I also used BLAST to classify the data using each of the reference datasets and a few percent identity thresholds as a kind of reference.

Comparing anml to anmlUSCA:
In general, anmlUSCA identified features to a higher taxonomic level, but it depended a bit on the confidence parameter. At 0.7 they were almost exactly the same (anml = 59% to species, anmlUSCA = 60% to species). At 0.5 it was 71% and 76% and at 0.3 it was 81% and 91% to species for anml and anmlUSCA respectively. There’s a comparison qzv linked at the bottom if you want to look at more details. Same general trend holds for BLAST. Looks to me like adding the training data from outside of the US and Canada reduces the confidence of the classifier by adding a bunch of sequences that actually don’t exist in the study area, but potentially that’s real uncertainty which the filtered dataset doesn’t capture.

Comparing to untrimmed data (also filtered differently):
At the species level, the naive-Bayes classifiers trained on trimmed data classifies more, but at higher taxonomic levels it’s more mixed. BLAST is the other way around though, classifying more to species with the untrimmed data. My guess is that’s mostly incorrect classifications in BLAST (matching sequences outside the target sequence), but it’s hard to know.

Training bold_anml_classifier2.qza:
Using feature-classifier I was able to train a naive Bayes classifier with less computing power. I don’t know the exact stats, but it took about 8.5 hours and maxed out at <65 GB of RAM (the total memory of the server I was on).

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads bold_anml_seqs.qza
–i-reference-taxonomy bold_anml_taxa.qza
–p-classify–chunk-size 2000
–o-classifier bold_anml_classifier2.qza

Though I used a chunk size of 2000, a chunk size of 4000 also looked like it was going to peak at <64 GB RAM before I stopped it for unrelated reasons. Chunk size of 5000 got killed because I hit my ~64 GB RAM limit.

Last, here’s a link to all the files, code, etc, but first a note. 1) This probably goes without saying, but the data I’m using to compare classifiers is my thesis data, so anyone can feel free to use it for tinkering with classifiers, but nothing else. If you’re interested in it past that, feel free to reach out. The link: OSF | COI Database Cont.


Thank you very much for this complete tutorial. I really appreciate it!
I have just started following it because we are about to analyze some COI data.

I have an issue with the step 4.
I just run seqkit for the filtering of min/max lengths, but I have retained only 89555 sequences.
I used the bold_drep1_*.qza provided. I checked the sequences and it has 1718762.

The commands I have used are:

seqkit seq --min-len 660 --max-len 1000 -w 0 ./tmpdir_boldFullSeqs/dna-sequences.fasta > boldFull_lengthFiltd_seqs.fasta

grep -c '^>' boldFull_lengthFiltd_seqs.fastaBlockquote
The seqkit version v0.16.1

Do you have any idea of what can be the problem?
Thank you very much!

Hi @laugon ,
Just to verify, can you please print the output from seqkit stats on both the input and output files:

seqkit stats ./tmpdir_boldFullSeqs/dna-sequences.fasta
seqkit stats boldFull_lengthFiltd_seqs.fasta

Sure! thanks!

file                                       format  type   num_seqs        sum_len  min_len  avg_len  max_len
./tmpdir_boldFullSeqs/dna-sequences.fasta  FASTA   DNA   1,718,762  1,069,186,958      250    622.1    1,600

file                             format  type  num_seqs     sum_len  min_len  avg_len  max_len
boldFull_lengthFiltd_seqs.fasta  FASTA   DNA     89,555  66,860,166      660    746.6    1,000

Great, thanks. It certainly seems as if the length-based filtering is working as expected, but the total number of sequences retained is too few. One other quick test to see if every other sequence discarded is outside of those lengths would be to perform two quick summaries:

seqkit seq --max-len 659 | seqkit stats
seqkit seq --min-len 1001 | seqkit stats

As I understand it, the problem is that your currrent output when filtering between 660-1000 bp results in 89,555 from the initial 1,718,762 sequences. That would indicate we are leaving out 1,629,207 sequences. When you run the above two seqkit stats options, check to see if the sum of those outputs match that value (1,629,207). If it's less than that, something is amiss with how the seqkit filtering step is working.

I have obtained exactly 1629207 seq from the sum of the two commands.

file  format  type   num_seqs      sum_len  min_len  avg_len  max_len
-     FASTA   DNA   1,598,172  959,516,705      250    600.4      659

file  format  type  num_seqs     sum_len  min_len  avg_len  max_len
-     FASTA   DNA     31,035  42,810,087    1,001  1,379.4    1,600

It seems that the majority of the seqs is bellow 660 threshold, but it should not right?

Thank you very much,

This is the step I think we're working on:

Its appears that we should expect 669,615 sequences, and you have far fewer passing through the seqkit filter. Because you have the expected number of raw sequences (1,718,762), the only thing I can think of is that something is amiss with how the raw sequences are structured. However, seqkit counted the expected number of sequences correctly, so it's not like the file is corrupted (and you were getting, say, zero sequences).

Without further having access to the ./tmpdir_boldFullSeqs/dna-sequences.fasta file that you have produced, I can't think of anything else to suggest to get the expected number of sequences. Maybe something is amiss with seqkit (though I highly doubt it). Perhaps you could calculate the lengths of your fasta sequences with a different tool, and filter with an alternative method. If you get the same number of sequences, then we can at least rule out seqkit as a suspect in this filtering mystery.

If your fasta sequences aren't line wrapped, you might try something like...

cat ./tmpdir_boldFullSeqs/dna-sequences.fasta | \
paste - - | cut -f 2 | awk '{print length}' > seqlengths.txt get a list of all the sequence lengths. You could then figure out how many sequences you have at some length threshold or another with commands like:

awk '$1 < 660' seqlengths.txt | wc -l
awk '$1 > 1000' seqlengths.txt | wc -l

You might also explore using RESCRIPt to perform a length-based filtering on the .qza file instead. One thing you might try to do is just append one additional operation between the end of Step3 and Step4, performing a length-based filtering before starting in with the giant alignment work. You'd just add another qiime rescript filter-seqs-length argument, just like is performed in Steps 1-2 earlier in the workflow, but with different values. In this case...

qiime rescript filter-seqs-length \
    --i-sequences bold_derep1_seqs.qza \
    --p-global-min 660 \
    --p-global-max 1000 \
    --o-filtered-seqs bold_derep1_lengthfiltKeepers_seqs.qza \
    --o-discarded-seqs bold_derep1_lengthfiltDiscards_seqs.qza

You should then be able to start up with the bold_derep1_lengthfiltKeepers_seqs.qza file to check and see if the expected number of sequences were retained.

1 Like

I have been trying to understand what´s goin on, but I can not see the problem.
I checked the line breaks in the dna-sequences.fasta and I changed them into one line fastas. I run the command cat..... > seqlengths.txt. This is the head of the file:

head seqlengths1.txt

It looks ok, I think. I checked with the awk and wc -l but I am still getting the same numbers:

awk '$1 < 660' seqlengths1.txt | wc -l
 awk '$1 > 1000' seqlengths1.txt | wc -l

It seems that most of the sequences are around 658, so maybe I have to lower the threshold for the filtering? I am attaching the seqlenghts1.txt and a subsample of the dna-sequences1.fasta.

seqlengths1.txt (6.6 MB)
subsample.dna-sequences1.fasta.txt (320.6 KB)

Thank you very much for all your help.

You're exactly right about the 658 bit @laugon:

I'm still not sure what could be going wrong here, but without a doubt, the inclusion of a sequence of length 658bp is going to retain very many more sequences.

Strangely, I can't seem to find a way to arriving at the expected 669,615 example myself. After downloading the seqlengths1.txt file you shared, I tried simply counting how many of each length there were:

sort seqlengths.txt | uniq -c | sort -k2,2n > seqlengths_counts.txt

... and then using awk to count how many were at a certain length or less:

awk '$2 < 601' seqlengths_hist.txt | awk '{sum += $1} END {print sum}'

You can see that by retaining all sequences at least 600bp long, you get many more sequences than what you have retained earlier, by specifying them to be at least 660 bp long.

Perhaps your best next step would be to carry the process forward using a slightly reduced --min-len parameter with Seqkit?

Alternatively, give the RESCRIPt plugin a shot with a compatible QIIME2 version. You can install it all together - easiest way for me is to create a new QIIME2 environment via Conda, then install RESCRIPt dependencies in that same conda environment - see here

I have not, because I do not have the plugin. I am using qiime2 2019.10 and I am not sure if it is compatible. I am going to check but it seems I need 2021.4 or later.
Thank you very much!

An off-topic reply has been split into a new topic: Trimming sequences for classifier training

Please keep replies on-topic in the future.

I am confused about the classifier file bold_full_ArthOnly_classifier.qza. Can I extract or export the fasta sequences from this?
Where would I get the corresponding files maybe named bold_full_ArthOnly_classifier_seqs.qza and bold_full_ArthOnly_classifier_taxa.qza?

Were you ever able to figure out how to do this? It seems as though the arthonly classifier was generated with scikit-learn version 0.21.2. I'm not sure how to downgrade this. So my other option was to re-generate it, but the seq/tax files were not included. Maybe Devon would be willing to share?

Just a note that the R code used for retrieving sequence data from BOLD may sporadically cause records to be dropped; I saw a few instances in the R script where the parser had issues with quotes and had to do a workaround. See this issue, which was recently addressed in the bold R package on GitHub and should be available in an upcoming release.



I was attempting to build a new BOLD database and was wondering if there was a suggested resolution to the error that began appearing after the update to "2020.8"...besides reverting to a version before the update. Here is the error: There was a problem importing bold_rawSeqs_forQiime.fasta:

bold_rawSeqs_forQiime.fasta is not a(n) AlignedDNAFASTAFormat file:

The sequence starting on line 34 was length 597. All previous sequences were length 602. All sequences must be the same length for AlignedFASTAFormat.

@John I got around it by using seqkit directly and removing gaps:

seqkit seq -w0 -g bold_rawSeqs_forQiime.fasta > bold_rawSeqs_forQiime.degapped.fasta


qiime tools import \
  --input-path bold_rawSeqs_forQiime.degapped.fasta \
  --output-path bold_rawSeqs.qza \
  --type 'FeatureData[Sequence]

You have to be a bit careful in steps w/ seqkit and use -w 0, as the default FASTA output will be wrapped at 60 characters so some of the one-liner steps like the following will break:

grep -v '^>' bold_rawSeqs_forQiime.degapped.fasta |  \
  sed 's/^N*//' | rev | sed 's/^N*//' | awk '{print length}' | sort | \
  uniq -c | sort -k2,2n > seqlength_Ntrimmed_hist.txt

2 off-topic replies have been split into a new topic: error while making BOLD reference database

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: extract-reads: why no matches?

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: can BOLD references be used on invertebrates

Please keep replies on-topic in the future.