This tutorial will describe how to create custom reference databases from NCBI Genbank using RESCRIPt.
Read more about RESCRIPt here: Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt - #7
If using NCBI Genbank data, please be aware of the NCBI disclaimer and copyright notice.
If you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following pre-print:
- Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
Getting data with the
get-ncbi-data method allows users to automatically download sequences and associated taxonomies from NCBI Genbank. Sequences can be selected using two different options:
--p-queryoption will allow using a specific entrez query, e.g., to select specific organisms, genes, BioProjects, or other search fields.
--m-accession-ids-fileoption will enable selection of specific accession IDs. This option accepts a metadata file, usually a tab-delimited text file where each accession ID is listed on separate lines, or an artifact that can be viewed as metadata (e.g., an existing sequence or taxonomy file in which the accession IDs are the feature IDs/index).
Getting lots of data with the
If you are downloading lots of data, you should set the
--p-n-jobs parameter to a number greater than one. Five works well in most cases. If you use just one job, the download can take a long time and will probably fail at some point. Conversely, if you use too many jobs, that will also cause your download to fail, usually with an error that contains the words "
ConnectionError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Read timed out.".
A large download should also only be started after 9 pm (US) Eastern Time and finish before 5 am
(as per NCBI policy). If you start a large download and do not set
--p-n-jobs to something sensible (say, five), you risk finishing outside of that timing window.
Creating a classifier from RefSeqs data
In this example, we will use RESCRIPt to download the NCBI 16S rRNA gene RefSeqs data.
qiime rescript get-ncbi-data \ --p-query '33175[BioProject] OR 33317[BioProject]' \ --o-sequences ncbi-refseqs-unfiltered.qza \ --o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza
This will by default generate a 7-level taxonomy with domain, phylum, class, order, family, genus, species. Other ranks can be selected with the
Note that some of the sequences look unusually short (as little as 300 nt!) and probably represent truncated 16S rRNA gene sequences. We can filter these out to focus on full-length sequences:
qiime rescript filter-seqs-length-by-taxon \ --i-sequences ncbi-refseqs-unfiltered.qza \ --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \ --p-labels Archaea Bacteria \ --p-min-lens 900 1200 \ --o-filtered-seqs ncbi-refseqs.qza \ --o-discarded-seqs ncbi-refseqs-tooshort.qza
As some functions in RESCRIPt evaluate taxonomic information, we want to make sure the taxonomies correspond to the nearly full-length sequences we’ve retained. We can filter out the corresponding taxonomies with
filter-taxa, using the
--m-ids-to-keep-file parameter (which takes a metadata file, including some artifacts, as input) to only retain features found in our filtered sequences file:
qiime rescript filter-taxa \ --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \ --m-ids-to-keep-file ncbi-refseqs.qza \ --o-filtered-taxonomy ncbi-refseqs-taxonomy.qza
Now we can train our classifier and evaluate its accuracy all in one go with
evaluate-fit-classifier. This command will fit a classifier (which usually takes a few minutes) and then test that classifier on all of the sequences found in the input, to give a best-case estimate of accuracy (i.e., when all query sequences have one or more known matches in the reference database). This is not, in general, a reliable indicator of accuracy but is a good sanity check to make sure that your classifier is not total garbage (e.g., poor accuracy when the correct answer is known would indicate that the species in the database cannot be easily resolved):
qiime rescript evaluate-fit-classifier \ --i-sequences ncbi-refseqs.qza \ --i-taxonomy ncbi-refseqs-taxonomy.qza \ --o-classifier ncbi-refseqs-classifier.qza \ --o-evaluation ncbi-refseqs-classifier-evaluation.qzv \ --o-observed-taxonomy ncbi-refseqs-predicted-taxonomy.qza
The outputs consist of:
ncbi-refseqs-classifier.qzais the trained classifier, ready to be used to classify some real query sequences! (this classifier is identical to the naive Bayes taxonomic classifier output by
qiime feature-classifier fit-classifier-naive-bayes).
ncbi-refseqs-classifier-evaluation.qzvcontains an interactive visualization of classifier accuracy at each taxonomic level. Take a look at this to confirm classifier performance (given the caveats noted above). Looks like this classifier performs quite well! (this is based on the assumption that your classifier has good coverage of the natural diversity you can expect, and that the species labels are accurate).
- Just in case you want to inspect the predicted taxonomy of your input sequences, you can take a look at
ncbi-refseqs-predicted-taxonomy.qza. Compare that to the input taxonomy to see how well your classifier did!
Including specific accession ids
--m-accession-ids-file options can be used simultaneously to fine-tune your queries. For instance, you could supplement the RefSeq sequences with those used for the STIRRUPS vaginal microbiome-specific database you could download the list of GenBank accession ids. The first step would be to copy the accession ids from that spreadsheet into a text file that would look something like:
id M59083.1 NR_037018.1 U10874.1 AF487679.1 HQ992948.1 NR_028978.1 AJ234039.1 AJ243892.1 AJ421779.1 AJ249326.1 AJ697609.1 AF355191.1 X80413.1 AF433168.1 AF489585.1 AB545933.1 EF558367.1 <snip>
The real file would contain 973 accessions. Note that the first line must contain a valid identifier column name, such as “id”. If you use an inappropriate label on the first line
qiime will complain.
Assuming that the above file were named
stirrups-accessions.txt, you could then include those sequences by replacing the call to
get-ncbi-data in the RefSeq case (above) with the command
qiime rescript get-ncbi-data \ --p-query '33175[BioProject] OR 33317[BioProject]' \ --m-accession-ids-file stirrups-accessions.txt \ --o-sequences ncbi-refseqs-unfiltered.qza \ --o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza
That data set contains some full genomes, so if you wanted to exclude those (for want of something better), you could again filter by length, but this time include a maximum length for each sequence by running
qiime rescript filter-seqs-length-by-taxon \ --i-sequences ncbi-refseqs-unfiltered.qza \ --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \ --p-labels Archaea Bacteria \ --p-min-lens 900 1200 \ --p-max-lens 1800 2400 \ --o-filtered-seqs ncbi-refseqs.qza \ --o-discarded-seqs ncbi-refseqs-tooshortorlong.qza
This would result in some loss of data, because you would lose those whole genomes. However, you could then continue with the example from the RefSeq section to build and validate a taxonomic classifier.