Building a COI database from BOLD references

:construction: :stop_sign::construction: :stop_sign::construction: :stop_sign::construction: :stop_sign::construction: :stop_sign:

Citations:

If you use the following COI resources or RESCRIPt for COI database preparation, please cite the following:

  • Michael S Robeson II, Devon R O’Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. RESCRIPt: Reproducible sequence taxonomy reference database management for the masses. bioRxiv 2020.10.05.326504; doi: https://doi.org/10.1101/2020.10.05.326504

  • O'Rourke, DR, Bokulich, NA, Jusino, MA, MacManes, MD, Foster, JT. A total crapshoot? Evaluating bioinformatic decisions in animal diet metabarcoding analyses. Ecol Evol. 2020; 00: 1– 19. https://doi.org/10.1002/ece3.6594

Preamble

This tutorial is a spinoff of the amazing work done by the developers behind RESCIPRt, and uses many of the same steps illustrated in the tutorial. Shout out to @BenKaehler, @thermokarst @Nicholas_Bokulich, and @SoilRotifer (and anyone else I missed - just let me know!)

Much like the RESCRIPt tutorial, the aim of this document is to provide an example of how to build and evaluate a reference database - but unlike the other RESCRIPt tutorial, we'll focus our attention on Cytochrome Oxidase I (COI) sequence data.

Available resources

Don't want to bother with creating your own database? I hear you... Or, maybe you do want to give it a shot, but can't be bothered to download all the raw data? Yep, that's also a pain. Tired of long tutorials!? Too bad - keep reading! For those of you who want to get your hands on some data right away, I've placed a few different files that you can use to either (1) build your own database from raw BOLD sequences, or (2) use directly in your classification steps in a QIIME 2 environment.

  1. Sequence data, taxonomy data, and associated metadata were acquired from The Barcode of Life Data System BOLD between July 1, 2020 (happy Canada Day!) and August 8, 2020 (happy International Cat Day!). Information on the process used to gather the data is documented below.
  • BOLD sequences were saved as bold_rawSeqs.qza, available here
  • BOLD taxonomy data was saved as bold_rawTaxa.qza, available here
  • Associated metadata is also available, saved as bold_rawMetadata_forQiime.tsv.gz, available here

Important caveat #1: these raw data contain all of the publicly available COI sequences I could scrape, so it's highly advised that you filter these before using them to classify you data. I've provided a number of examples of the kind of filtering considerations to apply in this tutorial, but you might find yourself wanting to quickly reduce this massive dataset by filtering for particular taxa, or perhaps using the metadata and filtering for particular country of origins specimen were collected. However you choose to use these raw data, I would strongly encourage you to filter them first prior to classifying anything - it'll improve the quality of classification, and speed it up to boot.
2. Or maybe you couldn't be bothered to filter it all yourself either! I've got you covered - just remember to throw some good karma to your friendly COI database creator, okay? The raw BOLD sequences were initially filtered for ambiguous nucleotide content (5 or more N's), long homopolymer runs (12 or more), very short (< 250 bp) or very long (> 1600 bp) sequences, and dereplicated. This produced a pair of files with the prefix bold_derep1:

You might be inclined to use those sequences to classify your COI data, however it's likely that trimming those references to contain only the sequence region spanning your sequence data itself would provide a better classification estimate. In other words, while these reference sequences are often between 500-650 bp, the primers you've used to generate your COI reads are probably not spanning that length. If, for example, you used a primer pair that is a subset of that region, you should follow the section in the tutorial below that shows how to take these bold_derep1 files and createte that primer-specific reference dataset.

I've included one such example of a primer-trimmed COI database. This one is specific to the ANML primers we use in our lab for bird and bat diet analyses. Important caveat #2: don't use these artifact files to classify your COI sequences if you used a different primer set! These primer-trimmed files include:

In addition, I created two naive Bayes classifier objects: (1) using all references from the bold_anml data, and (2) a classifier using only Arthropod references from the bold_full data. If you want to classify your sequences with sklearn, you can use the following files, but be aware that what these references do and do not contain with respect to sequence composition and reference taxonomy information :mag: :

As with the raw BOLD data, these bold_derep and bold_anml sequence and taxonomy artifacts consist of BOLD records for Animal, Protist, and Fungi taxa. You can certainly modify these by filtering for certain taxonomic groups or other metadata.

Background

Cytochrome Oxidase I is a commonly studied marker gene among animals (and other macro and microeukaryotes including plants, fungi, and protists!). The Barcode of Life Data System BOLD is one such repository of COI sequence information. Unfortunately the last versioned release of COI data came in 2015, however BOLD offers a variety of means with which to gather individual COI records. The following tutorial will show how to:

  1. Obtain BOLD COI sequences and associated taxonomy information
  2. Reformat these data and import into QIIME
  3. Filter these data in QIIME using options available via the RESCRIPt plugin
  4. Evaluate the database using options via RESCRIPt
  5. Create a naive Bayes classifier using RESCRIPt which can the be used in downstream classification of your particular COI experiment

Gathering data - overview

Gathering data from BOLD is really simple to do for a handful of species or sequences - just navigate to their taxonomy browser, search for some taxa, click on the "public data" icon once the search is complete, then download the combined sequence and taxonomy records in the red "combined" button (as either an .xml or .tsv format). You can, in principle, use this same method to download the entire BOLD database, but it can take between 2-4 days with a decent internet connection (I've tried), and be warned that if you lose signal, you'll start all over again (I have).

Fortunately BOLD offers another option which is scriptable, and thanks to the wonderful Scott Chamberlain and the ROpenSci community, there's a particular R package, the eponymous bold R library, that can be used to access the Barcode of Life Database from the command line. I created an R script, bold_datapull_byGroup.R that selects all COI records from BOLD by pulling all public records matching for all Animal, Fungi, and Protist taxa. Unfortunately you can't create a single term to search all BOLD COI records, as the Barcode of Life's API times out if you search for too many records at one time. As a result, and after a bunch of trial and error, I ended up breaking up these COI data into a number of smaller groups that I then combine outside of R. The output of this R script, for each group, is a pair of files that contain:

  • Sequence information and taxonomic information (used to create a fasta file). The sequence data contains gaps, and while these are dealt with once imported into QIIME, note that if you're trying to combine these raw fasta records with other datasets that don't contain gaps, you might run into errors when importing into QIIME when mixing gapped (ex. AT--GCG) and ungapped strings (ex. ATGCG).
  • Metadata including the sequenceID, processid, bin_uri, genbank_accession, country, institution_storing records. While the sequenceID represents a unique sequence identifier within BOLD, the processid applies to the larger specimen itself (as BOLD records may contain images, collection information, and other metadata), while the bin_uri record indicates the BOLD BIN number (which may or may not include multiple records with common or distinct sequence and taxonomic information!). I kept these data in my R script because it seemed like those metadata may be useful when filtering the database (more on one such application below), but I leave it up to the user to get creative in how they use this information for their own purposes.

Gathering data - the easy way

After running the R script, the raw data contained in the individual .csv files were combined and filtered to ensure no duplicate records were retained. The raw fasta and taxonomy text files were then converted into QIIME .qza formats. Links to these raw files are listed in the Available resources section above.

Gathering data - the gory details

Start by executing the R script, bold_datapull_byGroup.R. This creates 10 distinct pairs of .csv files ending with seqNtaxa.csv (for sequence and taxonomic information), and meta.csv (for metadata information). When I first got started building a COI database for analyzing bat diets, I figured I should keep only the arthropod COI data (because all my hosts were consuming were spiders and insects). Nevertheless, I learned that incorporating both expected diet COI data as well as unexpected COI sequences can be useful in data filtering. The chordate records, for instance, included bat COI data that helped me remove host sequences. Likewise, I've noticed that when you sample in an open environment it's possible for other things like fungal DNA to get amplified with your primers you thought were specific to a certain taxa. In short, I haven't found a compelling reason to discard COI data when compiling a database purely based on an unexpected taxonomy, but if someone knows of one I'd love to hear it. Notably, we can and will absolutely remove COI records based on other quality control factors (more on that below).

Once the R script is finished (expect somewhere between 4-24 hours to finish), combine all seqNtaxa.csv files and reformat into a fasta-formatted file (expect between ~ 2 minutes to finish!):

cat *.seqNtaxa.csv | \
grep -v 'sequenceID,taxon,nucleotides' | \
awk -F ',' '{OFS="\t"};{print $1";tax="$2, $3}' | \
sed 's/^/>/g' | \
tr '\t' '\n' > bold_allrawSeqs.fasta

There are a total of 3,518,537 sequences - that's a lot of COI data!

I wanted to make sure there wasn't any duplicate BOLD IDs imported by the process (spoiler, there were...). After a bit of troubleshooting (see below), it looks as if these are very rare, and they occur because there are multiple metadata entries with very similar but not identical records. They contain the same exact sequences, the same exact taxonomic strings, but odd differences in other metadata. To check on these curious cases of duplicate sequenceIDs, I first created a list of all the possible duplicate records:

grep '^>' bold_allrawSeqs.fasta | cut -f 1 -d ';' | sed 's/^>//' | sort | uniq -c | awk '{if ($1>1) print $0}' > duplicated_bold_records.txt

There were 41 instances where the BOLD processID contains at least one duplicate. In all 41 instances, the taxonomic information is identical for each, and there are only single duplicate entries. In other words, all instances where a BOLD ID occurs more than once are instances where they occur exactly two times. These aren't cases where we have different taxonomies assigned to the same BOLD ID - good. But I wanted to confirm whether or not these duplicate entries also just entirely duplicate records. In particular, do they contain the same sequence information?

for comp in $(awk '{print $2}' duplicated_bold_records.txt); do
  str1=$(grep -w $comp bold_allrawSeqs.fasta -A 1 | grep -v '^>' | grep -v '^--' | head -1);
  str2=$(grep -w $comp bold_allrawSeqs.fasta -A 1 | grep -v '^>' | grep -v '^--' | tail -1);
  if [ "$str1" == "$str2" ]; then
      echo $comp "Strings are equal"
  else
      echo $comp "Strings are not equal"
  fi
done

note the additional "sort | uniq" line added in making the raw fasta file - it removes those 41 entries...

It turns out every one of the 41 duplicated sequence records contain identical sequence strings (phew!). In this case, we can just delete the list of duplicate records. I opted to just rerun the above command generating the raw fasta file once more, and delete out duplicate entries.

cat *.seqNtaxa.csv | \
grep -v 'sequenceID,taxon,nucleotides' | \
sort | uniq | \
awk -F ',' '{OFS="\t"};{print $1";tax="$2, $3}' | \
sed 's/^/>/g' | \
tr '\t' '\n' > bold_allrawSeqs.fasta

Now there are just 3,518,496 sequences, thus the 41 duplicated records have been removed.

One final bit of a quality control check: it turns out that importing these raw data into QIIME will work, but using these data will fail because of some non-IUPAC characters lingering in the sequences themselves. The following script will search through every sequence and determine whether or not it contains a character outside of the accepted IUPAC character set. We then drop any sequence containing those bad sequences, and generate our final file. Specifically, the valid characters are the set: ['T', 'H', 'A', 'D', 'G', 'R', 'C', '.', 'S', 'M', 'B', 'N', '-', 'W', 'V', 'K', 'Y']

## getting the line numbers of the bad sequences
grep -v '^>' bold_allrawSeqs.fasta | grep [^THADGRC\.SMBNWVKY-] -n > badseqs.txt

## reformatting these line numbers to delete
for val in $(cut -f 1 badseqs.txt -d ':'); do
myvar1=$(($val*2))
myvar2=$(($myvar1-1))
echo $myvar2"d"
echo $myvar1"d"
done | tr '\n' ';' | sed 's/.$//' > droplines.txt

## deleting these line numbers
myval=$(cat droplines.txt)
sed -e $myval bold_allrawSeqs.fasta > cleaned_bold_allrawSeqs.fasta

Now there are 3,518,463 sequences remaining; 33 weird sequences were dropped. Not bad!

Finally, I used that cleaned fasta file to generate the pair of fasta and taxonomy text files in the format that QIIME accepts for import:

## First make the fasta file
sed 's/;tax=k.*//' cleaned_bold_allrawSeqs.fasta > bold_rawSeqs_forQiime.fasta

## Next make the taxonomy file
grep '^>' cleaned_bold_allrawSeqs.fasta | sed 's/^>//' | \
awk '{for(i=1;i<2;i++)sub(";","\t")}1' > bold_rawTaxa_forQiime.tsv

We'll now import two of these files as qiime objects for building our COI database:

  • bold_rawSeqs_forQiime.fasta is used to create a sequence .qza artifact (type == FeatureData[AlignedSequence])
  • bold_rawTaxa_forQiime.tsv is used to create the taxonomy .qza artifact associated with the fasta sequences (type == FeatureData[Taxonomy]). Because the raw BOLD sequence data may contain gaps, we'll need to import it slightly different than a standard fasta (that's why we use a --type 'FeatureData[AlignedSequence]' instead of a more typical --type 'FeatureData[Sequence]'):
## import sequence data
qiime tools import \
  --input-path bold_rawSeqs_forQiime.fasta \
  --output-path bold_rawSeqs.qza \
  --type 'FeatureData[AlignedSequence]'
## import taxonomy data
qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path bold_rawTaxa_forQiime.tsv \
  --output-path bold_rawTaxa.qza

One last thing before we jump into further processing these database .qza files in QIIME. There is metadata containing the country a sequence was collected, as well as it's (potential) GenBank accession ID, BOLD BIN number, etc. generatd from the same bold_datapull_byGroup.R script used to generate sequence and taxonomy information. These are the *meta.csv files. We combine and reformat these metadata files, but once the reformatting is complete as a standard text file, we're ready to use them directly in QIIME without any further importing as some .qza object.
Why bother with these metadata? You might want to remove any COI sequence that lacks a country of origin, or perhaps you want to restrict your database only to those sequences identified from Costa Rica. Maybe you wanted to discard any BOLD sequence mined from GenBank. You can do all of that by using the metadata information and follow the metadata-based filtering documentation in QIIME.
So, how do we create this metadata file?

cat *.meta.csv | \
grep -v 'sequenceID,processid,bin_uri,genbank_accession,country,institution_storing' | \
awk '{for(i=1;i<6;i++)sub(",","\t")}1' | sort -uk1,1 >  bold_rawMetadata_forQiime.tsv

You can use the bold_rawMetadata_forQiime.tsv.gz file as you see fit. I'm curious to hear how users might apply it.

RESCRIPt work

At this point I have hopefully shown you how to complete the first two tasks:

  1. Obtain BOLD COI sequences and associated taxonomy information
  2. Reformat these data and import into QIIME

The next step is to apply some sensible filters to these raw sequences. While there isn't a single best practice that would be applicable for every users experiment, I think there are a common set of quality controls that most researchers might want to consider:

  • removing sequences with a high number of ambiguous sequences (N)
  • removing sequences with an excessive number of homopolymers
  • removing extremely short sequences
  • removing redundant sequences (dereplicating)

For more on each of these considerations (and more!) see the RESCRIPt tutorial using a SILVA 16S dataset.

The first step in our workflow isn't even listed above - we need to remove the gaps in the BOLD sequences. This isn't necessarily a step other users may need to take should they be importing their own data from other sources, but if you have a - anywhere in the sequence itself, we need to remove it prior to building a classifier. Thankfully, RESCRIPt can tackle this. In the command below, we'll remove the gaps (if present) in a sequence. In addition, the --p-min-length parameter gives us an option to remove extremely short sequences. How short is too short? I wasn't sure when I first started this, and that led me to looking at two things:

  1. What would the distribution of the sequence lengths be of the degapped sequences look like?
  2. How would these distributions change if I trimmed the leading and trailing ambiguous characters (N) from the sequences (but leaving any internal ambiguous sites alone for now)?

I set the minimum length to 1 so that I could first obtain these degapped raw sequences (this took about an hour):

qiime rescript degap-seqs \
--i-aligned-sequences bold_rawSeqs.qza \
--p-min-length 1 \
--o-degapped-sequences bold_degapSeqs_min1.qza

Once I had those sequences, I exported the bold_degapSeqs_min1.qza object back to a standard fasta file, then created a frequency table of the untrimmed sequence lengths:

qiime tools export --input-path bold_degapSeqs_min1.qza --output-path degap_tmpdir
grep -v '^>' ./degap_tmpdir/dna-sequences.fasta | awk '{print length}' | sort | uniq -c | sort -k2,2n > seqlength_untrimmed_hist.txt

Then, I repeated that process but first had to remove the leading/trailing ambiguous nucleotides:

grep -v '^>' ./degap_tmpdir/dna-sequences.fasta | \
sed 's/^N*//' | rev | sed 's/^N*//' | \
awk '{print length}' | sort | uniq -c | sort -k2,2n > seqlength_Ntrimmed_hist.txt

Plotting these frequencies below illustrate that the vast majority of our sequences are in the range of about 500-650 bp, but there are instances of very short (as low as 54 bp) and very long (as long as 7659 bp) COI fragments. It also turns out that the leading/trailing N-trimming didn't make a big difference in the overall distribution of fragment sizes.

distribution COI fragments

The table below highlights that nearly all of our data is retained if we apply a filter under 200 bp, whether or not the N's were trimmed initially. As expected, the N-trimming results in a small decrease in the number of sequences retained relative to untrimmed data for each length threshold, but this effect is very minor generally. For example, if we were to filter at a 500 bp threshold length the difference is 1/10th of a percent. In absolute terms, the difference was 285 sequences fewer sequences among the N-trimmed data (3,510,119 untrimmed vs. 3,509,834 trimmed):

Min_len(bp) N-trimmed untrimmed
50 1.000 1.000
100 0.999 0.999
150 0.998 0.998
200 0.994 0.994
250 0.991 0.992
300 0.982 0.983
350 0.976 0.977
400 0.960 0.961
450 0.936 0.937
500 0.870 0.871
550 0.559 0.562
600 0.400 0.409
650 0.036 0.036
700 0.031 0.032
750 0.029 0.029

These data indicate that length-based filtering isn't going to be a particularly thorny issue, which is expected given that BOLD as a database is interested in acquiring most data in the 500-650 bp regions for submission acceptance. Nevertheless, I wasn't sure at all about how homopolymer-based filtering would work, and what those expectations might be. While RESCRIPt has a function to perform this kind of filtering, the parameter is a single integer applied to all bases equally. Given that different regions of COI might have different homopolymer variation, and that the database itself might consist of different sequencing platforms that are more or less suceptible to creating homopolymer data, I tallied the number of homopolymers in every sequence in the degapped dataset, for homopolymer lengths between 5-15 bp, for each character. This required creating a fasta file, homopolymer_motifs.fasta that looked a bit like this:

>seq1
AAAAA
>seq2
AAAAAA
>seq3
AAAAAAA
...
>seq42
TTTTTTTTTTTTT
>seq43
TTTTTTTTTTTTTT
>seq44
TTTTTTTTTTTTTTT

I used a package, seqkit, which makes searching the patterns in the fasta file easy:

seqkit locate -P -f homopolymer_motifs.fasta ./degap_tmpdir/dna-sequences.fasta --threads 4 | \
awk -v OFS="\t" 'NR>1 {print $1, $3}' | awk '{print $1, length($2)}' | \
sort | uniq | awk '{print $3}' | sort | uniq -c > homopolymer_motifs.table

importantly, we're not double counting instances where a sequence may contain multiple homopolymers of different DNA bases with the same length - we're collapsing that information and counting the number of sequences with one or more homopolymers of any non-ambiguous DNA base character

The homopolymer_motifs.table file is then used to generate the plot below. We see that using a homopolymer value of 8 or less would result in losing the majority of our sequences - let's don't do that! Instead, a value of 9 or greater likely would suffice to remove sequences that may indicate spurious sequences. Note that the values above the bar indicate the number of sequences with at least 1 homopolymer - for example, there are 89 sequences that have at least one A, T, C, or G base that is repeated 15 consecutive times!

homopolymer_image

So where does this leave us in creating the database? There are many tasks we still need to address, some of which I think are broadly applicable to every user regardless of their end goal, some of which are user-specific:

General

  1. Removing sequences outside of some length threshold (lower and/or higher!)
  2. Removing trailing/leading ambiguous nucleotides.
  3. Removing sequences with large numbers of "internal" ambiguous nucleotides.
  4. Removing sequences with large stretches of homopolymers.
    User specific
  5. Dereplicating the data.
  6. Retaining only those sequences that are applicable to the COI region we have amplified.

We might be better off starting at the end, and tackling point 6 about selecting only the portions of these raw data that align to your primers. If you do that first, it's quite possible that a lot of the length, ambiguous nucleotide, and homopolymer quality-control filtering might ultimately not apply to the parts of the COI amplicon you aren't even concerned with (as a lot of these issues arise at the ends of fragments). Nevertheless, restricting COI regions by primer can get tricky: a wise rotifer biologist once pointed out to me that it's possible to have a sequence in a database that was amplified by the same primers you're trimming with, but before that reference sequence was submitted, the researcher trimmed off the primers - see this post for more details. As a result, you can lose otherwise good data by requiring particular primer regions! There's also the issue of how specfic a database should be, and whether or not a single primer set is useful for your own research. Perhaps it would be more sensible to expand on that one primer boundary and include a handful of other primers as well that other researchers might use in neighboring regions. In my work with arthropod COI data, for example, a series of alternative forward and reverse primers exist, all of which might be used by different researchers to amplify different COI regions.

In fact, that same database wizard biologist offered a totally different way of thinking about parsing these data, such that rather than trimming by primer sequence, we figure out where the regions of interest are based on alignment. This takes longer, but should retain those instances where our reference sequences are missing the primer bits, but retain everything else. This work was described in full detail for processing a SILVA dataset outside of RESCRIPt - in partiuclar, we're stealing all the good stuff from step 7.

Unfortunately for us, unlike that SILVA workflow, we don't have any alignment file to start with, so we're going to throw some serious computational power to get this to work. We'll take the following approach:

  1. Remove any degapped sequences with more than 5 ambiguous nucleotides or at least 12 homopolymer characters. Notably I'm starting with the degapped fasta file that was exported from the bold_degapSeqs_min1.qza file. RESCRIPt developers will likely have a parameter to do this work directly on the file in the near future.
  2. Of the remaining sequences, remove any shorter than 250 bp and longer than 1600 bp.
  3. Dereplicate these data.
  4. Perform a massive alignment of all of the remaining, dereplicated sequences.
  5. Use a particular primer pair to identify the boundaries that we want to retain among our COI sequences. Then use these boundary coordinates to trim our dataset to those regions.
  6. Remove gaps again from the previously aligned (and now Folmer-region-specific) sequences, and filter for length (250bp or less are removed).
  7. Dereplicate a second time (because we care about just this region now)!

If that all works, we should have a nice COI database ready for classification (as well as ready for further evaluation within RESCRIPt!).

Steps 1-2: Quality control filtering

The first step is to create a fasta file with prefix/suffix ambiguous bases removed. At the moment, I had to do this manually outside of QIIME (but stay tuned for potential parameters to do this within qiime rescript degap-seqs!). I then import that file back as a .qza object; because we no longer have gaps in our data, we change the --type too.

Recall that the file dna-sequences.fasta was obtained by exporting the bold_degapSeqs_min1.qza object earlier.

cat ./degap_tmpdir/dna-sequences.fasta | paste - - | \
awk -v OFS="\t" '{gsub(/^[N]+/,"",$2);print}' | \ ## removes prefix N's
rev | sed 's/^N*//' | \ ## removes suffix N's
rev | tr '\t' '\n' > bold_outerNtrimmed.fasta

qiime tools import \
  --input-path bold_outerNtrimmed.fasta \
  --output-path bold_outerNtrimmed.qza \
  --type 'FeatureData[Sequence]'

We can now accomplish the first two step to (1) remove sequences with excessive numbers of ambiguous nucleotides and/or homopolymers, and (2) filter by length. Both of these are accomplished by a pair of RESCRIPt commands:

qiime rescript cull-seqs \
    --i-sequences bold_outerNtrimmed.qza \
    --p-num-degenerates 5 \
    --p-homopolymer-length 12 \
    --o-clean-sequences bold_ambi_hpoly_filtd_seqs.qza

qiime rescript filter-seqs-length \
    --i-sequences bold_ambi_hpoly_filtd_seqs.qza \
    --p-global-min 250 \
    --p-global-max 1600 \
    --o-filtered-seqs bold_ambi_hpoly_length_filtd_seqs.qza \
    --o-discarded-seqs bold_ambi_hpoly_length_discarded_seqs.qza

There are now 3,464,989 total sequences

The bold_ambi_hpoly_length_filtd_seqs.qza object has sequences filtered for ambiguous nucleotide content, homopolymers, and a minimum and maximum length. We're getting somewhere!

Step 3 - Dereplicating (part 1)

We're dereplicating these data using a RESCRIPt function that is called a "super" mode. In brief, we are applying a Least Common Ancestor (LCA) method, while giving preference to those sequences that are represented more often than another. For example, say we had 4 identical sequences in our original dataset, with the following taxonomies:

>100;tax=k__Animalia;p__Annelida;c__Polychaeta;o__Phyllodocida;f__Nereididae;g__;s__
>101;tax=k__Animalia;p__Annelida;c__Polychaeta;o__Phyllodocida;f__Nereididae;g__Neanthes;s__Neanthes fucata
>102;tax=k__Animalia;p__Annelida;c__Polychaeta;o__Phyllodocida;f__Nereididae;g__Neanthes;s__Neanthes fucata
>103;tax=k__Animalia;p__Annelida;c__Polychaeta;o__Phyllodocida;f__Nereididae;g__Neanthes;s__Neanthes fucata

What would happen in this case is we would retain the full taxonomic description including the species-rank information because the majority of the sequences contain those identical labels. The assumption we're making is that the first sequence (>100) was an incomplete description. If we didn't prefer the more frequent label, then the final record would lose genus and species rank information. It's also true that if the majority of sequences lack information at a given rank, then even those other identical sequences with that information are discarded.

For instance, in the following scenario:

>201;tax=k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Scarabaeidae;g__Serica;s__
>202;tax=k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Scarabaeidae;g__Serica;s__
>203;tax=k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Scarabaeidae;g__Serica;s__
>204;tax=k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Scarabaeidae;g__Serica;s__Serica porcula

we would discard the species information and retain a record that contained only the genus name.

On to dereplicating these data. This should cut our dataset down by at least half, easing the burden on the subsequent multiple sequence alignment process. The output sequence object, dereplicated-sequences bold_derep1_seqs.qza is exported and that fasta file is then used to perform the multiple alignment step. Note that the --p-derep-prefix flag is another trick to reduce redundancy: in this case, if one sequence is an exact subset of another sequence, we keep the longer of the two sequences. FWIW, when I didn't use this flag the first time, I retained about 500,000 additional sequences than the second time I ran the script with it. When you have to do a pairwise alignment of every sequence, every little bit counts (squared).

qiime rescript dereplicate \
--i-sequences bold_ambi_hpoly_length_filtd_seqs.qza \
--i-taxa bold_rawTaxa.qza \
--p-mode 'super' \
--p-derep-prefix \
--p-rank-handles 'greengenes' \
--o-dereplicated-sequences bold_derep1_seqs.qza \
--o-dereplicated-taxa bold_derep1_taxa.qza

Dereplcating these data reduced the number of sequences from 3,464,989 to 1,718,762.

minor point: adding the --p-derep-prefix flag makes a substantial difference. I ran the same command without it and we retain over 2.2 million sequences instead of 1.7 million.

One teeny thing I noticed only after going through this workflow a few times and getting errors in downstream evaluations: you may want to perform an additional filter of those dereplicated sequences so that you don't retain sequences that contain very little taxonomic information. For example, if you into the taxonomy strings associated with the bold_derep1_taxa.qza object, you'll find that there are 32 sequences have this taxonomic info:

tax=k__Animalia;p__;c__;o__;f__;g__;s__

notably, the only data missing Phylum-rank information are within the Animal domain/kingdom

Depending on your own database inputs, it might be the case that there is a fair bit of missing taxonomic information remaining. In our case, it's relatively minimal, but as an example, we can run the following code to remove those sequences that lack Phylum-rank information. Note that I didn't actually apply this filter - I'm just showing it so that you have a sense of what actions to take if needed.

## filter taxonomy file
qiime rescript filter-taxa \
--i-taxonomy bold_derep1_taxa.qza \
--p-exclude 'p__;' \
--o-filtered-taxonomy bold_derep1_PhylumFiltd_taxa.qza

## filter sequences file
qiime taxa filter-seqs \
--i-sequences bold_derep1_seqs.qza \
--i-taxonomy bold_derep1_taxa.qza \
--p-exclude 'p__;' \
--o-filtered-sequences bold_derep1_PhylumFiltd_seqs.qza

Step 4 - (Multiple!) multiple sequence alignments

My goal was to create a COI database that was bounded by a particular primer sequence. For instance, in my animal diet work, our lab uses ANML primers, which create amplicons of around 180 bp. There are lots of other COI primer sets out there, and most are derivations to amplify a sequence bound within one of the original primer sets for COI work (a.k.a. the Folmer primers). The goal here is to identify the coordinates where the primers align to the reference sequences, then remove any up and downstream sequence bits outside of the primer pair of interest. In this example, we'll use the ANML primer pair, but you can use whatever pair of primers you want!

The process consists of a series of multiple sequence alignment (MSA) steps, beginning with the dereplicated sequences created from the previous step - bold_derep1_seqs.qza. The first step requires aligning a subset of high quality sequences to themselves. Next, the primers of interest are added to this subsetted reference alignment. Finally, we create a giant alignment by adding in all the remaining sequences from our original reference set (except those seqs that were subset out to create the initial small alignment). When all that alignment work is done, we'll move on to removing all of the up/downstream sequences, remove the gaps in the aligned sequences, and generage a reference dataset with only sequences information specific to the primers we care about. Nice!

Step 4a: Aligning a subset of sequences from our giant reference dataset

This first part is critical, and I struggled to get this right the first half dozen tries. The goal is to first identify a subset of high quality sequences that will align to each other without producing a huge number of gaps. Given that the majority of COI sequences are from insects, and are usually around 658 bp, I started by exporting the bold_derep1_*.qza files from the previous step, then filtering those dereplicated fasta sequences. The options below can certainly be achieved in QIIME, but I leaned heavily on the seqkit package to perform these operations:

First, export the bold_derep1_seqs.qza and bold_derep1_taxa.qza objects:

qiime tools export --input-path bold_derep1_seqs.qza --output-path tmpdir_boldFullSeqs
qiime tools export --input-path bold_derep1_taxa.qza --output-path tmpdir_boldFullTaxa

Next, filter for min/max length:

seqkit seq --min-len 660 --max-len 1000 -w 0 ./tmpdir_boldFullSeqs/dna-sequences.fasta > boldFull_lengthFiltd_seqs.fasta
grep -c '^>' boldFull_lengthFiltd_seqs.fasta

669,615 seqs remain

Now we will search back through our taxonomy file to identify only those FeatureIDs labeled as insects and contain complete species descriptions. We then select just those length-filtered sequences from the insect.list file:

## generate the .list
grep 'p__Arthropoda;c__Insecta' ./tmpdir_boldFullTaxa/taxonomy.tsv | \
grep -v 's__$' | grep -v 's__.*sp.$' | grep -v '-' | cut -f 1 > insect_species.list
## filter the sequences to include only those in the .list
seqkit grep --pattern-file insect_species.list boldFull_lengthFiltd_seqs.fasta > boldFull_lengthNtaxaFiltd_seqs.fasta
grep -c '^>' boldFull_lengthNtaxaFiltd_seqs.fasta

There are 510,105 in the insect_species.list file; 246,650 of these pass the length filter from above (246,650 seqs retained).

We'll perform one final filter where we retain only sequences with non-ambiguous bases (must be only A|C|G|T):

seqkit fx2tab -n -i -a boldFull_lengthNtaxaFiltd_seqs.fasta | awk '{if ($2 == "ACGT") print $1}' > nonAmbig_featureIDs.txt
seqkit grep --pattern-file nonAmbig_featureIDs.txt boldFull_lengthNtaxaFiltd_seqs.fasta -w 0 > boldFull_lengthNtaxaNambigFiltd_seqs.fasta
grep -c '^>' boldFull_lengthNtaxaNambigFiltd_seqs.fasta

228,351 seqs remain

We'll subsample those filtered and aligned sequences to just the first 2,000:

seqkit sample --rand-seed 101 --number 2000 --two-pass boldFull_lengthNtaxaNambigFiltd_seqs.fasta -w 0 > bold_mafftRefs_subseqs_seqs.fasta

CAUTION: Heads up on that last step!! This was the part that really threw me for a loop, and likely takes some trial and error depending on your primer set and reference sequences. It may be that subsetting too many or too few random sequences incorporates too much variability in your alignment, and messes with the subsequent mapping of primers to the MSA file (you're about to make in the next step). For example, when I subsetted just 1,000 sequences (instead of 10,000) I noticed that my primers wouldn't align to the MSA file properly. When I subsetted to 10,000 sequences, it gave a different set of primer mapping coordinates that were also incorrect (more on this explained in the next section). The sweet spot I found for this subsample was 2,000 sequences. But it might have nothing to do with the particular number of random sequences you collect either - it might be that including sequences of certain lengths are useful or problematic. Or it could be because of the sequence diversity among the group (maybe don't try to align cnidarians with hemipterans in your few samples, for instance). The general point I'd raise here is that if you struggle to get your primers to align properly, consider any or all of these filtering steps above. Maybe change the taxonomic representation to be more refined (maybe just do it with a single Family or Genus, for example). Maybe change the length of sequences. The goal is always going to be to find somewhere over (at least) 100 high quality reference sequences that you trust will (1) align with few gaps, and (2) contain both the primer sequences you're going to align. Unfortunately, there's no magic trick here. Just trial, error, coffee, more error, and a bunch of silent (or vocal) cursing.

We'll create an initial alignment of just these 2,000 reference sequences using MAFFT (I installed v7.471). Also note that it's worth your while to specify the temporary directory where mafft will put all the files. It won't matter for just these first thousand sequences, but it's possible that you can max out the default /tmp folder when you're running bigger alignments in later steps. I just specify it up front here so I don't forget it later :slight_smile:

The process should be really fast - expect to complete the alignment in under a minute if you have more than 12 CPUs at your disposal (though this obviously will vary on the length and number of sequences in your own alignment):

export MAFFT_TMPDIR=$(pwd)   ## change this path to suit your needs
mafft --auto --thread -1 bold_mafftRefs_subseqs_seqs.fasta > reference_MSA

note the output here mentions the method used by --auto:
Strategy:
FFT-NS-2 (Fast but rough)
Progressive method (guide trees were built 2 times.)

We can now use that reference_MSA to guide our next steps.

Step 4b: aligning the small reference set to the primers of interest

We're going to use the ANML primers described above. We'll use the standard 5'->3' orientation for the forward primer, but take the reverse complement of the listed 5'->3' sequence for the reverse primer. Thus, whatever primer you end up using, just make sure to take the reverse complement of the reverse primer's 5'-->3' orientation (if you're really unsure, just add a bunch of possible forward/reverse/reversecomplement orientations to your fasta file and see what aligns!). These primer sequences are put into a fasta file (let's call it anml_primers.fasta) formatted like this:

>PRIMER_ANML-f
ggtcaacaaatcataaagatattgg
>PRIMER_ANML-r
ggatttggaaattgattagtwcc

Now we'll align these primer sequences to our small alignment created in the last step. This step is also very fast, again assuming you have just a pair of primers aligning to around the same number (2000) of sequences as I am aligning. Expect under a minute to complete:

mafft --multipair --addfragments anml_primers.fasta --keeplength --thread -1 --mapout --reorder reference_MSA > ref_primer

the --mapout flag creates an output file, anml_primers.fasta.map, which contains information regarding the coordinate positions of where our primers are located within the small alignment. These values should be sufficient for our subsequent filtering of up/downstream sequences once the giant alignment is created, but we'll run this same option later to double check that the primer coordinate positions haven't changed once the larger alignment is built.

This step is a little strange when you realize that we're going to be essentially duplicating the actions again in a few steps. Why do things twice? It's really a sanity check to confirm that the primers you've selected are actually going to map to the reference sequences you've retained. It's quick to get this info for the small dataset (and takes a lot more effort to do with the giant alignment). Thus, we're aligning our primers against this small set of references to confirm they are indeed mapping in the right location by viewing the contents of that anml_primers.fasta.map file. What do we see in that file? It contains information regarding the primer sequence character (column one), the position of that sequence character internal to the primer sequence itserlf (column two), and the position of that sequence character relative to the alignment file (third column). It looks like this (I've added the ... to shortern up the view to include just the first and last pair of nucleotides in the primer alignment):

>PRIMER_ANML-f:
# letter, position in the original sequence, position in the reference alignment
g, 1, 17
g, 2, 18
...
g, 24, 40
g, 25, 41
>PRIMER_ANML-r:
# letter, position in the original sequence, position in the reference alignment
g, 1, 249
g, 2, 250
...
c, 22, 300
c, 23, 301

In other words, our forward primer, PRIMER_ANML-f begins at coordinate position 17 of the ref_primer alignment, and ends at position 41. This primer is 25 bases long (the second column starts at 1 and ends at 25), so in this particular case there no gaps in the alignment within the forward primer (41 - 17 + 1). Our reverse primer, PRIMER_ANML-r, begins at position 249 of the alignment file, and ends at position 301. This primer is only 23 bases long, yet there are 53 positions spanning the length of the primer - why? alignment gaps.

I relied on a Python script, extract_alignment_region.py, to check out what would happened if we trimmed our alignment at those primer flanks - would the resulting fragments, once gaps are removed, be the expected length of our amplicon when the specific primers are used? In the script below, we're going to enter coordinate positions for the valueStart and valueEnd positions based on the information in the anml_primers.fasta.map file. Specifically, we want to valueStart to be one base longer than the largest value of the forward left flank coordinate, and the valueEnd to be one base shorter than the smallest value on the right flank coordiante. In our case, that would be 42 for >PRIMER_ANML-r and 248 for >PRIMER_ANML-f:.

The following script is courtesy of a generous rotiferphile (if that's a thing?), Mike Robeson.

python extract_alignment_region.py \
  -i ref_primer \
  -o ref_primer_coordinateTrimmed.fasta \
  -s 42 \
  -e 248

We then use that fasta file, ref_primer_coordinateTrimmed.fasta, remove any remaining gaps, and cound the lengths of the remaining sequences. If we aligned at the right space, the majority of these sequences should be around 180 bp - the length of the amplicons we generate with the ANML primers, as shown in this COI primer map - when you look at the labels on that map, we're using the LepF1-LCO1490 as ANMLf, and CO1-CFMRa, as ANMLr.

seqkit seq -g -w0 ref_primer_coordinateTrimmed.fasta | paste - - | awk '{print length ($2)}' | sort | uniq -c

What we find is that indeed, the vast majority of our sequences are exactly 181 bp (1714 of 2000 samples), and about 97% of these subsambled seqs are between 177-181 bp (1939 of 2000), so we're good to go! Unfortunately, if you don't get this right, your entire database will be trimmed to a primer coordinate that you aren't expecting. Once you get things aligned to the expected positions and of the expected lengths, you can move on to the giant alignment step. Let's tackle that next!

Step 4c: aligning all the rest of the sequences

Now for the much heavier task: aligning the million+ references to this comparatively tiny alignment of just 2,002 sequences (our initial 2,000 reference sequences plus our two primers). Let's start by filtering the complete reference artifact, bold_derep1_seqs.qza, to remove the 2,000 reference sequences already aligned, then exporting that file into a traditional fasta format:

## create a list of the sequences to drop:
grep '^>' bold_mafftRefs_subseqs_seqs.fasta | sed 's/^>//' > droplist.txt

## filter OUT the sequences to include only those NOT in the droplist.txt file
seqkit grep -v --pattern-file droplist.txt ./tmpdir_boldFull/dna-sequences.fasta > seqsForGiantAlignment.fasta

This subsetted seqsForGiantAlignment.fasta fasta contains 1,716,762 sequences - exactly 2,000 fewer than we started. That's what we expect, given that we used 2,000 of these to generate our initial reference_MSA file. We'll now add these 1.7 million sequences to generate a giant alignment!

At long last - to the giant alignment! We align the remaining reference sequences to our small alignment file, ref_primer as follows:

export MAFFT_TMPDIR=$(pwd)   ## change this path to suit your needs (in case you forgot the first time!)
mafft --auto --addfull seqsForGiantAlignment.fasta --keeplength --thread -1 ref_primer > giant_alignment

This step takes a large amount of memory to run. I ended up requiring around 150 GB RAM to get it to complete. The job finished in about 60 minutes using 24 CPUs (I've run the same task on another machine with 8 CPUs and it took about 3 hours to finish, so it seems to scale pretty linearly on the number of threads available). The mafft alignment strategy used was: FFT-NS-full

The output, giant_alignment, contains the information we need to now: (1) identify the alignment positions where our primers are bound, and then (2) trim all sequences up and downstream from those positions.

Step 5 - Using the alignment information to create a primer-bound reference dataset

We'll use the MAFFT output giant_alignment file in the next step in two ways: (A) We'll perform a sanity check by aligning those same primers to a subset of sequences from the giant file and gather the coordinates where they are bound. Recall back in step #4a that we created an output file that contained the coordinate positions of the primers relative to the small alignment - we want to double check that this new alignment has primers aligning to the same positions. It doesn't matter what the subset of sequences are, because the alignment length for all sequences is fixed. Then (B), we apply a Python script to trim the aligned bold_derep1_mafft.out file based on those coordinate boundaries (rather than the primers themselves!).

Step 5a - Confirmation of primer coordinate positions in giant alignment file

Rename the original output map file, to a new name to avoid overwriting in the next step:

mv anml_primers.fasta.map anml_primers.fasta.map_toSmall

Subset the giant alignment file to retain just the first 2,000 sequences. Forutnately, the seqkit package is savvy enough to deal with line-wrapped fastas (i.e. it doesn't matter if the fasta sequence is on one line or many - you'll get 2,000 unique sequences this way):

seqkit head -n 2000 giant_alignment > subset_giant_alignment

Align those same primer sequences used earlier to the subsetted giant alignment file, subset_giant_alignment. Note we don't actually need to retain the output fasta file copy_giant_alignment_wdupPrimers - we just want to confirm the coordinate positions of the primer sequences!

mafft --multipair --addfragments anml_primers.fasta --keeplength --thread -1 --mapout subset_giant_alignment > subset_giant_alignment_secondaryPrimerAlign

We can ignore the subset_giant_alignment_secondaryPrimerAlign output. The information we want is in the Let's rename the anml_primers.fasta output file. Let's rename this anml_primers.fasta output to keep consistent with the naming convention used earlier too when we renamed the original .map file:

mv anml_primers.fasta.map anml_primers.fasta.map_toGiant

What we want to investigate are the coordinate positions of those primers in the anml_primers.fasta.map_toGiant file:

>PRIMER_ANML-f:
# letter, position in the original sequence, position in the reference alignment
g, 1, 17
g, 2, 18
...
g, 24, 40
g, 25, 41
>PRIMER_ANML-r:
# letter, position in the original sequence, position in the reference alignment
g, 1, 249
g, 2, 250
...
c, 22, 300
c, 23, 301

Sure enought, nothing has changed - perfect! That's what we expected given that we passed the --keeplength flags in our mafft alignments, but it's always good to be sure.

With our primers confirmed to be aligned to the right coordinates in our large alignment file, we can now move to the final steps of this part of the workflow: trimming the any up/downstream bases outside of our primer coordinates.

Step 5b - trimming the giant alignment to include only sequences within our primer coordinates

With the ANML-primer boundaries defined, the giant alignment of all our reference sequences, giant_alignment, can be sliced such that all of our database contains sequences in just the COI region our amplicon data is expected to align to. Once again, the heavy lifting is done via the Python script used earlier, extract_alignment_region.py. We'll use the same coordinate positions for the valueStart and valueEnd positions as before:

python extract_alignment_region.py \
  -i giant_alignment \
  -o bold_derep1_anml_tmp.fasta \
  -s 42 \
  -e 248

and volia! we have a trimmed dataset with a defined boudary without having to require any primers be detected within the reference sequences themselves. One last thing before we wrap up this section - remember how we've added the primer sequences into this alignment file? We obviously don't want to keep them in our final dataset (they aren't reference sequences, after all!). The neat thing here is that if we've trimmed the correct positions, those primers themselves should be cut out. Let's double check on this:

grep PRIMER --color -A 1 bold_derep1_anml_tmp.fasta

You'll notice the output looks something like this:

>PRIMER_ANML-r
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
>PRIMER_ANML-f
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

... just a bunch of gaps - no primers left. Looks like our coordinate trimming worked as expected!

The last thing to do is remove those two primer sequences to have our final alignment file. We'll use seqkit once more to make this fast and simple:

## make a list of sequence headers to drop:
grep PRIMER bold_derep1_anml_tmp.fasta | sed 's/>//' | sed 's/[ \t]*$//' > dropPrimers.txt

## exclude these sequences from the alignment file:
seqkit grep -v -f dropPrimers.txt bold_derep1_anml_tmp.fasta | seqkit seq --upper-case -w 0 > bold_derep1_anml.fasta

There are 1,718,762 sequences in bold_derep1_anml.fasta - good. That's the number of sequences we started with when we were aligning the sequences from the bold_derep1_seqs.qza file!

Step 6 - Removing gaps, filtering by length

We're almost finished with this database design (I promise)! Next up is to remove the gaps of this aligned file, then filter the sequences by length. I was curious how the alignment strategy above would impact the sequence lengths of the remaining fragments. After all, not every sequence in our reference dataset would necessarily contain sequence spanning the entire ANML primer region - some might in fact contain none of that region, or just a part. I thus calculated the length of each sequence, and came up with this histogram representing the frequency of sequence lengths of the x file.

The calculation was...

grep -v '^>' bold_derep1_anml.fasta | sed 's/-//g' | awk '{print length}' | sort | uniq -c > anml_seq_lengths.freq.table

... and the resulting plot of that anml_seq_lengths.freq.table file, created with this R script, looks like this:
anml_primer_COI_seqLength

What becomes apparent when visualizing these results is that the majority (~ 53%) of our sequences are represented by just two lengths (180 and 181 bp), and about 80% of sequences are represented by fragments between 150 to 180 bp. It's also worth noting that there were 60,664 sequences with a length of zero. Clearly we want to filter those out! I found that visualizing these data was helpful to inform me about what sequence length would be appropriate for the following command. I chose 170 bp to try to balance the desire to retain as many distinct sequences as possible while wanting to retain only as near full-length a sequence as possible. At 170 bp, I still retain about 69% of all sequences (over 1.25 million!).

Start by importing this file back into QIIME (as all remaining tasks will occur in that environment). Then remove gaps and filter sequences by length:

qiime tools import \
  --input-path bold_derep1_anml.fasta \
  --output-path bold_derep1_anml_seqs_aligned.qza \
  --type 'FeatureData[AlignedSequence]'

qiime rescript degap-seqs \
   --i-aligned-sequences bold_derep1_anml_seqs_aligned.qza \
   --p-min-length 170 \
   --o-degapped-sequences bold_derep1_anml_seqs_nogaps.qza

Following the length-filtering of the gap-removed sequences, we have 1,182,895 reference COI sequences remaining

We now have one step to go - using the bold_derep1_anml_seqs_nogaps.qza file as input for a second dereplication step.

Step 7 - Dereplicating (part 2)

Wait - didn't we do this already? The idea behind a second dereplication strategy is that it's possible that two sequences might have identical sequences within our now ANML-primer-trimmed boundary, but had initially different sequence composition outside of that boundary. For example, you can imagine that an initial pair of sequences looked like:

       | <---- ANML region -----> | <--- trimmed region ---> |
seq1:  AATTGGCCAATTCCGGAATTCCGGCCGGAAAATTTTCCCCGGGGATATATATCGC
seq2:  AATTGGCCAATTCCGGAATTCCGGCCGGAAGGGGGGGAAAAAAAACCCCCCTTTT

The two sequences are clearly different across their full lengths, but once we were to slice only those regions within the primer region, there's a chance that we now have duplicate sequences. It's possible you want to keep these duplicate sequences if they represent different taxonomies - see the subsection in the dereplication of sequences and taxonomy portion of the SILVA RESCRIPt tutorial on "Dereplication considerations" for commentary on why you might do this. In my case, I'm opting for a single consensus taxonomy for each unique sequence, so I am going to dereplicate using the same strategy as before:

qiime rescript dereplicate \
  --i-sequences bold_derep1_anml_seqs_nogaps.qza \
  --i-taxa bold_rawTaxa.qza \
  --p-mode 'super' \
  --p-derep-prefix \
  --p-rank-handles 'greengenes' \
  --o-dereplicated-sequences bold_anml_seqs.qza \
  --o-dereplicated-taxa bold_anml_taxa.qza

Once this dereplication step is complete, we have 739,345 unique COI sequences. That's it, we're at the end of the database construction section - we made it! Considering that we started with over 3 million COI sequences from BOLD, clearly all these filtering steps have made an impact.

The road(s) ahead

What work remains to be done?

  1. While this database workflow accessed only BOLD references, it's quite likely that incorporating other references from alterantive resources would improve our classifiers. Shameless plug: I explored just this question in a recent paper. Fortunately, the RESCRIPt crew has developed a tool, get-ncbi-data to pull NCBI GenBank data right into a QIIME formatted database. It's my intention on expanding this tutorial to show how we might include both BOLD and NCBI references into an improved COI database.

  2. Obviously, we haven't even touched on any of the evaluation steps available within RESCRIPt! I'm still working on these tests at the moment and will update this document to reflect those findings. Fortunately, all the tests rely specifically on RESCRIPt commands - provided you have a database ready to go, just follow the evaluation functions in the RESCRIPt tutorial. Critically, one of these functions allows for an evaluation of a naive Bayes classifier, and produces a classifier file in the process! Thus far, I've created just a single classifier for the bold_anml dataset, but plan on creating on for the bold_derep1 dataset also. Please check back for updates on this progress.

The classifier was created as follows:

qiime rescript evaluate-fit-classifier \
  --i-sequences bold_anml_seqs.qza \
  --i-taxonomy bold_anml_taxa.qza \
  --p-reads-per-batch 6000 \
  --p-n-jobs 6 \
  --output-dir fitClassifier_boldANML

This was the most memory-intesive process of the entire workflow! The job completed in about three days using approximately 140 GB RAM

The bold_anml_classifier.qza file is available here


:partying_face: !! additional file Update!! :partying_face:

I recently created a classifier file using the untrimmed COI sequences for Arthropod taxa only - this is saved as bold_full_ArthOnly_classifier.qza, and I've hosted it here. Just a head's up - it's a big file (~500 Mb)!

13 Likes

This is amazing! Thank you so much for all the effort, explanations and sharing your ideas with us. I am really excited.

At this moment, I am trying to produce a classifier for the bold_derep1 dataset. I am trying this with the help of the qiime rescript evaluate-fit-classifier as indicated at the end of the tutorial. However, I am struggling with the memory errors I am getting over and over again:

“Unable to allocate 11.8 GiB for an array with shape (8192, 192871) and data type float64”.

I am using VirtualBox, which is already has a Base memory of 53091 MB… Does anyone know how I can improve this process so the classifier can be created?

Thanks in advance!

Margreet

1 Like

Hi @Margreet,
I am very glad to hear that @devonorourke's tutorial and rescript are useful to you!

Unfortunately training a classifier (whether it is with RESCRIPt or q2-feature-classifier, which is wrapped as part of the classifier training step) is very memory intensive since BOLD COI is a massive database! As @devonorourke indicated in his tutorial, he found that the max memory use was much higher than your VB memory allocation:

So my advice would be to use @devonorourke's pre-trained classifiers if at all possible. You can look at the data provenance to see what scikit-learn version was used to train the classifier; you will need to use the classifier in a QIIME 2 environment with an equivalent version.

If your primers only amplify a particular clade, you could also filter your database to only contain that clade, reducing the size and complexity of the database by removing sequences that would not amplify. Don't do this to remove sequences that are not "of interest" but could be amplified, otherwise you could get misclassification if any of these appear in your query sequences.

You could also use one of the other methods in q2-feature-classifier for taxonomic assignment, eliminating the need to train a classifier (though I can't guarantee that you won't still have issues with memory use — BOLD COI is huge!)

Or you could do what I believe @devonorourke had to do to train the classifier — access a higher memory machine, either if you have access to a compute cluster or AWS.

Let us know if any of those options will work for you!

2 Likes

Hi @Margreet,
I’m just about finished with creating a classifier for the bold_derep1 dataset. At the moment I have already completed classifier files for:

  • only arthropods
  • only chordates

However, I’m still waiting on a job to finish for the full BOLD dataset. It’s been running for about 5 days now and has required over 100GB RAM, so I’m skeptical that you’ll be able to complete the task on a virtual machine with just 50 GB.

Just keep sending me reminders about sending you the file, and I’ll keep checking to see when it’s done.

By the way, @Nicholas_Bokulich and the other RESCRIPt developers have a manuscript coming up that is going to discuss whether it’s even useful to use the untrimmed COI sequences like those found in derep, versus trimming to your specific amplicon region like I’ve done with the ANML classifier. The short answer for COI appears to be that you shouldn’t… can you tell me what primers you’re using to generate your COI amplicons? And, if possible, what your expected COI taxonomy will be?

3 Likes

Hi @Nicholas_Bokulich,

Thank you very much for your quick and clear responds!

I was already a little afraid for that. I tried to lower my --p-reads-per-batch (started with 6000, ended with 500), but this is still a too memory-intensive process for the VB I am using.

At this point, I am thinking about filtering my database, by reducing the size and complexity by removing sequences that would not amplify according to the primers I used. I realize that it is a bit of a tricky process, so at the same time I think am going to look for a higher memory-machine so these kind of database-struggles will be less in the future.

Thanks again for all the effort. It is really appreciated!

Margreet

Hi @devonorourke,
Thank you for your responds and effort! It is great to hear that you were already working on creating a classifier for the bold_derep1 dataset. Just like you asked me to send you quick reminders: can you check whether it is already finished? :wink:

Also @Nicholas_Bokulich told me that it seems impossible to create the classifier on the computer or with the VB I am using at this moment. So at this point, I am thinking about filtering my database, by reducing the size and complexity by removing sequences that would not amplify according to the primers I used. I realize that it is a bit of a tricky process, so at the same time I am thinking about looking for a higher memory-machine so these kind of database-struggles will hopefully be less in the future.

Very interesting to hear about the discussion whether it’s useful to use the untrimmed COI sequences or using specific amplicon sequences only… I think my main concern is to minimize the database too much so there will be a chance that relevant sequences will be excluded. What would be the number 1 reason to not use the untrimmed COI sequences, if you don’t mind me asking? The primers I used to generate my COI amplicons are degenerated versions of the FOLMER primers (jgLCOI1490_CS1-mlCOIintR_CS2 and mlCOIintF_CS1-HCO2198_CS2).

Please let me know if you have any questions, and thanks again!

Margreet

A couple of things. First, primer trimming is useful because it avoids any off-target alignments. This really should be quite rare, but by restricting your database to include only the portion of sequences you generated with your amplicons, you avoid the possibility of classifying some OTU that was a hit on your full length sequence in a region that it really wasn't derived from.

However, I think the possibility of generating false positives is the more realistic reason to do this. @SoilRotifer likely has a much more refined take on this, and can perhaps give better insight on why this is a useful idea (or not!?). To see why these false positives might happen, let's imagine you use a classifier built on untrimmed sequences that the region of your amplicon is invariant across multiple distinct taxa.

Say you have an ASV or OTU that you want to classify. There is a pair of reference sequences that it will align perfectly to (or in the case of a kmer composition, will look identical to). Those reference sequences look like this:

       | <--------------- full length of reference ------------> |
       | <---- primer region -----> | <--- untrimmed region ---> |
ref1:  AATTGGCCAATTCCGGAATTCCGGCCGGAAAATTTTCCCCGGGGATATATATCGC
ref2:  AATTGGCCAATTCCGGAATTCCGGCCGGAAGGGGGGGAAAAAAAACCCCCCTTTT

When you build any classifier, we recommend that you first dereplicate sequences. You can do that with full length, or with primer-trimmed sequences. In the case of building a classifier with the full length sequences for ref1 and ref2, we can see they are not identical across their full length, so these wouldn't be dereplicated.

But you aren't generating amplicons with those full length sequences. You are classifying sequences based on that region restricted within that primer region. Within the primer region, those two reference sequences are identical.

If we run our samples through a classifier that uses an alignment approach, and our OTU/ASV matches to these sequences, then both would be considered a top hit. At this point, if you took the first item in this list, you might be retaining the first species-level taxa and think that your OTU was really that species, when in fact it's just as likely to be another organism with a different taxonomic identity, despite having the same sequence composition.

But that's not absolutely the case - it depends on what you did with your classification parameters! With QIIME, you could also apply a Least Common Ancestor process to these data, and by retaining both top hits then applying an LCA, this would absolve you of that issue - it would just keep the Myotis genus, and leave the species label in your OTU/ASV as "unassigned" (as it should).

I'm entirely unclear on what the consequences would be when training the classifier through sklearn - perhaps @Nicholas_Bokulich or others could indicate how the classifier would treat trimmed or untrimmed references, and how this might impact classification.

Interested to hear others thoughts - thanks for the questions @Margreet!

1 Like

Hi @devonorourke!
Would you be able and willing to check whether the file is finished and send it to me when it is? I was also really curious about the completed classifier for only arthropods. Please let me know, and thanks again for all the effort!

If you have any questions, don't hesitate to ask of course!

Margreet

Hey @Margreet

Unfortunately, I don’t think the full classifier of all BOLD sequence is going to be available in the next few days - the server I was running the job on is about to shut down for maintenance, and the job still hadn’t finished after 16 days :snail:

However, I did have an already finished COI classifier using the untrimmed Arthropod sequences, and posted that to this form. See the bottom of the original post for the file link. Let me know how it goes!

2 Likes

Hi Devon

I’ve followed the first few steps up to importing the data from QIIME and am presumably missing something: you import the sequences as ‘FeatureData[AlignedSequence]’ but this fails for me as the sequences aren’t yet aligned: the sequences are not the same length. Is there a step I’m missing where you align the sequences prior to importing them, trim the sequences, or remove sequences which aren’t the most common length?

Thanks, this will be a fantastic tool!
Dave

Hey @hemprichbennett,

Welcome to the forum :qiime2: :partying_face: !

I don't think you're missing anything - if you're referring to the step where you import the raw BOLD sequences, you have to refer to these as FeatureData[AlignedSequence] because a great many of them contain gap characters (-). If you try to import them as 'FeatureData[Sequence] instead, you'll get an error, indicating that the data contain characters that aren't allowed. So they reason we're importing as that data type is simply to avoid the import error.

However, you're 100% correct - these data are not aligned. We have to do that later on in the workflow. They aren't the same length either (though you can see in this tutorial that a great many of them fall within a very similar length). All the steps (and more!) you're curious about regarding trimming and filtering sequences are going to happen after these raw BOLD sequences are imported.

Let me know if something in the workflow isn't working for you. Thanks, and good luck.

2 Likes

Hi Devon,

Great, thanks for the welcome and thanks for clarifying! I’ve now isolated the problem: qiime2 2020.8.0 doesn’t allow the fasta file to be imported as FeatureData[AlignedSequence], however qiime2 2019.10.0 does. As it appears that the 2019 version is currently required for using rescript then I guess its unlikely that many others will be encountering this error, but its perhaps worth noting that it could be an issue as time goes on. When trying to use qiime2 2020.8.0 it gives me the error message

There was a problem importing bold_rawSeqs_forQiime.fasta:

bold_rawSeqs_forQiime.fasta is not a(n) AlignedDNAFASTAFormat file:

The sequence starting on line 38 was length 618. All previous sequences were length 659. All sequences must be the same length for AlignedDNAFASTAFormat.

I have two other minor issues.

  1. I’m on OSX and guess the tutorial was written for linux? For the early bash loop to work I have to change the cut command from cut -f 1 badseqs.txt -d ‘:’ to cut -f 1 -d ‘:’ badseqs.txt. If the linux cut syntax is a bit more relaxed then it could be worth changing that?
  2. The input-format command on the taxonomy import step is currently commented out :wink:

Cheers, I’ll continue through the tutorial now :slight_smile:

Thanks for the :mag: investigative work @hemprichbennett,
Perhaps @Nicholas_Bokulich or @thermokarst might know what’s going on with the discrepancy between what I was doing earlier and this error message with the most recent QIIME 2 release. In any event, glad you’re getting it to work with an earlier version.

Could you clarify your second point/minor issue? Were you getting an error message if you didn’t uncomment that line of code? I don’t recall that line being required. In other words, this should be sufficient, right?

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-path bold_rawTaxa_forQiime.tsv \
  --output-path bold_rawTaxa.qza

Just to be clear, I’ve removed the #--input-format HeaderlessTSVTaxonomyFormat line…

For me it actually requires the input format to be specified, if I remove that line or keep it commented out then I get the error

There was a problem importing bold_rawTaxa_forQiime.tsv:

bold_rawTaxa_forQiime.tsv is not a(n) TSVTaxonomyFormat file:

[‘Feature ID’, ‘Taxon’] must be the first two header values. The first two header values provided are: [‘10001757’, ‘tax=k__Animalia;p__Annelida;c__Polychaeta;o__Phyllodocida;f__Nereididae;g__;s__’] (on line 1).

So it looks like by default on my machine it expects a header in the file. I guess the default input format may differ between versions of Qiime2?

Thanks for the sanity check. You were right all along - it’s just how I ended up creating this taxonomy file (it has no header, so QIIME expects me to pass along that parameter I had commented out to indicate that the incoming file has no header). You can import a taxnomy file without that parameter, but you have to pass in the two-column line it expects:

Feature ID    Taxon

I updated the documentation to reflect that the parameter is required. Thanks for all your troubleshooting! Keep it up (though hopefully, no more troubles :slight_smile: )!

Sure thing!

Prior to the 2020.8 release, the import validation for FeatureData[AlignedSequence] only looked at the first 5ish lines of the file. As of 2020.8 we now have implemented robust validation for these types of data.

1 Like

Thanks @thermokarst,
What I’d like to be able to do is import a sequence that is not aligned, but does contain gap characters. When retrieving reference sequences from BOLD, it’s possible for a sequence to contain one or more gap characters like this:

> seq1
AAT-CGAAACAGA-TC

What’s awful is that they aren’t actually aligned! So, in order to use the fancy new tools available in RESCRIPt, I need to be able to import these raw sequences. Trying to import them as a regular FeatureType[Sequence] will fail, correct? Is there an option that allows for gap characters to be tolerated as sequence data?

Certainly, it’s straightforward to remove these with other tools prior to importing in QIIME, but I wanted to check whether I could get these into QIIME straight from the BOLD downloads.

Thanks!

Since you are already doing some pre-processing of the seqs, maybe just add degapping to the list (with tr, sed, awk, whatever) instead of importing and degapping with RESCRIPt. After all, as you say these are not really alignments just seqs with gaps.

Of course, doing it all in RESCRIPt would be great, so that the processing steps are stored in provenance... but:
a) this is really an issue specific to BOLD
b) you are doing other pre-processing steps that are also specific for formatting BOLD

so a bit of degapping prior to importing would not hurt.

The best solution would be to eventually get a BOLD-specific formatting step (e.g., create a BOLD fasta format with the appropriate transformers such that the seqs and taxonomy are automatically formatting when any BOLD seqs are imported). :wink:

1 Like

I suppose my preference is to get data in its earliest form in QIIME so that provenance tracking can be retained, like you suggested. With that in mind, I guess the trade-off in question is provenance versus uniformity. It would be great if these raw sequences could be imported into QIIME, perhaps with a warning message noting that some characters my cause downstream errors, rather than an outright error message preventing their import. Specifically, allowing FeatureType[Sequence] could allow these gap characters, even if they aren’t actually alignment files. That would resolve my BOLD specific issue, but perhaps could also resolve others should they import data with gaps from other sources?

2 Likes

I’m making my way through, no major issues to report yet! Having some difficulties with mafft introducing lots of gaps, but thats a problem for next week now :wink:

One minor thing: theres a typo in the step where you remove leading/trailing ambiguous nucleotides. You use grep on ./degap_tmpdir/dna-sequences.fasta/dna-sequences.fasta when it should be ./degap_tmpdir/dna-sequences.fasta

I have a general question about the overall logic of the pipeline: I understand the utility of trimming to the region of the primer of interest, but I wonder how well this will work for assigning non-target taxa. In our case, sometimes when amplifying bat diet with arthropod-specific primers we’ll actually amplify some bat or fungal DNA even though there isn’t a perfect binding site. To assign these sequences as bat we would need to have some mammal sequences in our reference library, but I assume that trimming mammal DNA in-silico using arthropod-specific primers won’t work, as the primers won’t match the template. However it also seems problematic to include full-length mammal folmer regions in a database of heavily trimmed arthropod sequences. Just wondering your thoughts on this?

Thanks very much
Dave