Using RESCRIPt to compile sequence databases and taxonomy classifiers from NCBI Genbank

This tutorial will describe how to create custom reference databases from NCBI Genbank using RESCRIPt.

Read more about RESCRIPt here: Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt - #7

If using NCBI Genbank data, please be aware of the NCBI disclaimer and copyright notice.

Citation:

If you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following pre-print:

  • Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581

Getting data with the get-ncbi-data method

The get-ncbi-data method allows users to automatically download sequences and associated taxonomies from NCBI Genbank. Sequences can be selected using two different options:

  1. The --p-query option will allow using a specific entrez query, e.g., to select specific organisms, genes, BioProjects, or other search fields.
  2. The --m-accession-ids-file option will enable selection of specific accession IDs. This option accepts a metadata file, usually a tab-delimited text file where each accession ID is listed on separate lines, or an artifact that can be viewed as metadata (e.g., an existing sequence or taxonomy file in which the accession IDs are the feature IDs/index).

Getting lots of data with the get-ncbi-data method

If you are downloading lots of data, you should set the --p-n-jobs parameter to a number greater than one. Five works well in most cases. If you use just one job, the download can take a long time and will probably fail at some point. Conversely, if you use too many jobs, that will also cause your download to fail, usually with an error that contains the words "ConnectionError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Read timed out.".

A large download should also only be started after 9 pm (US) Eastern Time and finish before 5 am
(as per NCBI policy). If you start a large download and do not set --p-n-jobs to something sensible (say, five), you risk finishing outside of that timing window.

Creating a classifier from RefSeqs data

In this example, we will use RESCRIPt to download the NCBI 16S rRNA gene RefSeqs data.

qiime rescript get-ncbi-data \
    --p-query '33175[BioProject] OR 33317[BioProject]' \
    --o-sequences ncbi-refseqs-unfiltered.qza \
    --o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza

This will by default generate a 7-level taxonomy with domain, phylum, class, order, family, genus, species. Other ranks can be selected with the --p-ranks parameter.

Note that some of the sequences look unusually short (as little as 300 nt!) and probably represent truncated 16S rRNA gene sequences. We can filter these out to focus on full-length sequences:

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences ncbi-refseqs-unfiltered.qza \
    --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \
    --p-labels Archaea Bacteria \
    --p-min-lens 900 1200 \
    --o-filtered-seqs ncbi-refseqs.qza \
    --o-discarded-seqs ncbi-refseqs-tooshort.qza

As some functions in RESCRIPt evaluate taxonomic information, we want to make sure the taxonomies correspond to the nearly full-length sequences we’ve retained. We can filter out the corresponding taxonomies with filter-taxa, using the --m-ids-to-keep-file parameter (which takes a metadata file, including some artifacts, as input) to only retain features found in our filtered sequences file:

qiime rescript filter-taxa \
    --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \
    --m-ids-to-keep-file ncbi-refseqs.qza \
    --o-filtered-taxonomy ncbi-refseqs-taxonomy.qza

Now we can train our classifier and evaluate its accuracy all in one go with evaluate-fit-classifier. This command will fit a classifier (which usually takes a few minutes) and then test that classifier on all of the sequences found in the input, to give a best-case estimate of accuracy (i.e., when all query sequences have one or more known matches in the reference database). This is not, in general, a reliable indicator of accuracy but is a good sanity check to make sure that your classifier is not total garbage (e.g., poor accuracy when the correct answer is known would indicate that the species in the database cannot be easily resolved):

qiime rescript evaluate-fit-classifier \
    --i-sequences ncbi-refseqs.qza \
    --i-taxonomy ncbi-refseqs-taxonomy.qza \
    --o-classifier ncbi-refseqs-classifier.qza \
    --o-evaluation ncbi-refseqs-classifier-evaluation.qzv \
    --o-observed-taxonomy ncbi-refseqs-predicted-taxonomy.qza

The outputs consist of:

  1. ncbi-refseqs-classifier.qza is the trained classifier, ready to be used to classify some real query sequences! (this classifier is identical to the naive Bayes taxonomic classifier output by qiime feature-classifier fit-classifier-naive-bayes).
  2. ncbi-refseqs-classifier-evaluation.qzv contains an interactive visualization of classifier accuracy at each taxonomic level. Take a look at this to confirm classifier performance (given the caveats noted above). Looks like this classifier performs quite well! (this is based on the assumption that your classifier has good coverage of the natural diversity you can expect, and that the species labels are accurate).
  3. Just in case you want to inspect the predicted taxonomy of your input sequences, you can take a look at ncbi-refseqs-predicted-taxonomy.qza. Compare that to the input taxonomy to see how well your classifier did!

Including specific accession ids

The --p-query and --m-accession-ids-file options can be used simultaneously to fine-tune your queries. For instance, you could supplement the RefSeq sequences with those used for the STIRRUPS vaginal microbiome-specific database you could download the list of GenBank accession ids. The first step would be to copy the accession ids from that spreadsheet into a text file that would look something like:

id
M59083.1
NR_037018.1
U10874.1
AF487679.1
HQ992948.1
NR_028978.1
AJ234039.1
AJ243892.1
AJ421779.1
AJ249326.1
AJ697609.1
AF355191.1
X80413.1
AF433168.1
AF489585.1
AB545933.1
EF558367.1
<snip>

The real file would contain 973 accessions. Note that the first line must contain a valid identifier column name, such as “id”. If you use an inappropriate label on the first line qiime will complain.

Assuming that the above file were named stirrups-accessions.txt, you could then include those sequences by replacing the call to get-ncbi-data in the RefSeq case (above) with the command

qiime rescript get-ncbi-data \
    --p-query '33175[BioProject] OR 33317[BioProject]' \
    --m-accession-ids-file stirrups-accessions.txt \
    --o-sequences ncbi-refseqs-unfiltered.qza \
    --o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza

That data set contains some full genomes, so if you wanted to exclude those (for want of something better), you could again filter by length, but this time include a maximum length for each sequence by running

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences ncbi-refseqs-unfiltered.qza \
    --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \
    --p-labels Archaea Bacteria \
    --p-min-lens 900 1200 \
    --p-max-lens 1800 2400 \
    --o-filtered-seqs ncbi-refseqs.qza \
    --o-discarded-seqs ncbi-refseqs-tooshortorlong.qza

This would result in some loss of data, because you would lose those whole genomes. However, you could then continue with the example from the RefSeq section to build and validate a taxonomic classifier.

Enjoy!

10 Likes
Building a COI database from NCBI references
Taxonomy Levels from NCBI custom Database
Rescript: to Generate Functional Genes' Taxonomy and Sequence References
RESCRIPt question: Can you/should you dereplicate NCBI data?
DADA2 with COI primers - hard to diagnose sequence loss?
Has anyone used chloroplast "contaminants" to do host phylogeography?
dereplication error with data collected from `get-ncbi-data` command
Building a COI database from BOLD references
Matching ASVs with 16s Sanger amplicons.
classifier for fungi data analysis
Sanger sequences as a database to look for contaminations
Help Needed Generating Database with RESCRIPt
where to get reference taxonomy for training custom classifier?
Reference Alignments: Best Way to Download Many Sequences from NCBI GenBank
Import RDP classifier output taxonomy file into qiime2
Import reference sequences database and train classifier for mcrA sequences
How can I process MD5sum format sequence data in the QIIME2?
Using classify-consensus-blast with NCBI data (via RESCRIPt)
incomplete taxonomic assignment
Classifier trained on NCBI RefSeq 16S rRNA data gives weird results
Pipline for already demultiplexed, non-barcoded samples
rescript get-ncbi-data ‘dict’ object has no attribute ‘add’
Building a COI database from BOLD references
classifier does not support confidence values in rescript
How to build a classifier from a Fungene database
Taxonomy file creation query
DADA2 with SILVA 99% OTU classifier
[Newbie] HELP in making a classifier from a FASTA and taxonomy file from an online database
Large portion of Cyanobacteria
Classifier trained on NCBI RefSeq 16S rRNA data gives weird results
DADA2 and Deblur outputs are extremely different
Using RESCRIPt's 'extract-seq-segments' to expand a reference sequence set.
Large portion of Cyanobacteria

3 posts were split to a new topic: rescript: how to handle gapped sequence data

A post was split to a new topic: rescript get-ncbi-data ‘dict’ object has no attribute ‘add’

A post was merged into an existing topic: Taxonomy file creation query

2 posts were split to a new topic: Merge separate sequence and taxonomy artifacts output from RESCRIPt

An off-topic reply has been split into a new topic: RESCRIPT error: Download did not finish. Reason unknown

Please keep replies on-topic in the future.