Using RESCRIPt to compile sequence databases and taxonomy classifiers from NCBI Genbank

This tutorial will describe how to create custom reference databases from NCBI Genbank using RESCRIPt.

Read more about RESCRIPt here: Processing, filtering, and evaluating (reference) sequence data with RESCRIPt

If using NCBI Genbank data, please be aware of the NCBI disclaimer and copyright notice.

Citation:

A proper software announcement is in progress. For now, if you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following:

Getting data with the get-ncbi-data method

The get-ncbi-data method allows users to automatically download sequences and associated taxonomies from NCBI Genbank. Sequences can be selected using two different options:

  1. The --p-query option will allow using a specific entrez query, e.g., to select specific organisms, genes, BioProjects, or other search fields.
  2. The --m-accession-ids-file option will enable selection of specific accession IDs. This option accepts a metadata file, usually a tab-delimited text file where each accession ID is listed on separate lines, or an artifact that can be viewed as metadata (e.g., an existing sequence or taxonomy file in which the accession IDs are the feature IDs/index).

Getting lots of data with the get-ncbi-data method

If you are downloading lots of data, you should set the --p-n-jobs parameter to a number greater than one. Five works well in most cases. If you use just one job, the download can take a long time and will probably fail at some point. Conversely, if you use too many jobs, that will also cause your download to fail, usually with an error that contains the words “ConnectionError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Read timed out.”.

A large download should also only be started after 9 pm (US) Eastern Time and finish before 5 am
(as per NCBI policy). If you start a large download and do not set --p-n-jobs to something sensible (say, five), you risk finishing outside of that timing window.

Creating a classifier from RefSeqs data

In this example, we will use RESCRIPt to download the NCBI 16S rRNA gene RefSeqs data.

qiime rescript get-ncbi-data \
    --p-query '33175[BioProject] OR 33317[BioProject]' \
    --o-sequences ncbi-refseqs-unfiltered.qza \
    --o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza

This will by default generate a 7-level taxonomy with domain, phylum, class, order, family, genus, species. Other ranks can be selected with the --p-ranks parameter.

Note that some of the sequences look unusually short (as little as 300 nt!) and probably represent truncated 16S rRNA gene sequences. We can filter these out to focus on full-length sequences:

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences ncbi-refseqs-unfiltered.qza \
    --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \
    --p-labels Archaea Bacteria \
    --p-min-lens 900 1200 \
    --o-filtered-seqs ncbi-refseqs.qza \
    --o-discarded-seqs ncbi-refseqs-tooshort.qza

As some functions in RESCRIPt evaluate taxonomic information, we want to make sure the taxonomies correspond to the nearly full-length sequences we’ve retained. We can filter out the corresponding taxonomies with filter-taxa, using the --m-ids-to-keep-file parameter (which takes a metadata file, including some artifacts, as input) to only retain features found in our filtered sequences file:

qiime rescript filter-taxa \
    --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \
    --m-ids-to-keep-file ncbi-refseqs.qza \
    --o-filtered-taxonomy ncbi-refseqs-taxonomy.qza

Now we can train our classifier and evaluate its accuracy all in one go with evaluate-fit-classifier. This command will fit a classifier (which usually takes a few minutes) and then test that classifier on all of the sequences found in the input, to give a best-case estimate of accuracy (i.e., when all query sequences have one or more known matches in the reference database). This is not, in general, a reliable indicator of accuracy but is a good sanity check to make sure that your classifier is not total garbage (e.g., poor accuracy when the correct answer is known would indicate that the species in the database cannot be easily resolved):

qiime rescript evaluate-fit-classifier \
    --i-sequences \
    --i-taxonomy ncbi-refseqs-taxonomy.qza \
    --o-classifier ncbi-refseqs-classifier.qza \
    --o-evaluation ncbi-refseqs-classifier-evaluation.qzv \
    --o-observed-taxonomy ncbi-refseqs-predicted-taxonomy.qza

The outputs consist of:

  1. ncbi-refseqs-classifier.qza is the trained classifier, ready to be used to classify some real query sequences! (this classifier is identical to the naive Bayes taxonomic classifier output by qiime feature-classifier fit-classifier-naive-bayes).
  2. ncbi-refseqs-classifier-evaluation.qzv contains an interactive visualization of classifier accuracy at each taxonomic level. Take a look at this to confirm classifier performance (given the caveats noted above). Looks like this classifier performs quite well! (this is based on the assumption that your classifier has good coverage of the natural diversity you can expect, and that the species labels are accurate).
  3. Just in case you want to inspect the predicted taxonomy of your input sequences, you can take a look at ncbi-refseqs-predicted-taxonomy.qza. Compare that to the input taxonomy to see how well your classifier did!

Including specific accession ids

The --p-query and --m-accession-ids-file options can be used simultaneously to fine-tune your queries. For instance, you could supplement the RefSeq sequences with those used for the STIRRUPS vaginal microbiome-specific database you could download the list of GenBank accession ids. The first step would be to copy the accession ids from that spreadsheet into a text file that would look something like:

id
M59083.1
NR_037018.1
U10874.1
AF487679.1
HQ992948.1
NR_028978.1
AJ234039.1
AJ243892.1
AJ421779.1
AJ249326.1
AJ697609.1
AF355191.1
X80413.1
AF433168.1
AF489585.1
AB545933.1
EF558367.1
<snip>

The real file would contain 973 accessions. Note that the first line must contain a valid identifier column name, such as “id”. If you use an inappropriate label on the first line qiime will complain.

Assuming that the above file were named stirrups-accessions.txt, you could then include those sequences by replacing the call to get-ncbi-data in the RefSeq case (above) with the command

qiime rescript get-ncbi-data \
    --p-query '33175[BioProject] OR 33317[BioProject]' \
    --m-accession-ids-file stirrups-accessions.txt \
    --o-sequences ncbi-refseqs-unfiltered.qza \
    --o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza

That data set contains some full genomes, so if you wanted to exclude those (for want of something better), you could again filter by length, but this time include a maximum length for each sequence by running

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences ncbi-refseqs-unfiltered.qza \
    --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza \
    --p-labels Archaea Bacteria \
    --p-min-lens 900 1200 \
    --p-max-lens 1800 2400 \
    --o-filtered-seqs ncbi-refseqs.qza \
    --o-discarded-seqs ncbi-refseqs-tooshortorlong.qza

This would result in some loss of data, because you would lose those whole genomes. However, you could then continue with the example from the RefSeq section to build and validate a taxonomic classifier.

Enjoy!

2 Likes

3 posts were split to a new topic: rescript: how to handle gapped sequence data

A post was split to a new topic: rescript get-ncbi-data ‘dict’ object has no attribute ‘add’