Creating a custom reference database

Hello Everyone,

I am trying to analyze sequences generated from non-familiar barcodes (CO1, EF1) where the Greenegenes and Unite databases might not be useful. Is there any detailed protocol on how to build a database from reference sequences downloaded from NCBI. I appreciate your help.


Hi @asr17,

What format are you sequences in at the moment? If they are genbank records, we’ll need to convert them to fasta first, otherwise you should be in good shape.

Our goal is to create a FeatureData[Sequence] artifact and a FeatureData[Taxonomy] artifact. You are already halfway there on the FeatureData[Sequence], but you will probably want each sequence to have only the accession ID as the FASTA ID. Depending on if your seqs are from GenBank or RefSeq your data might look like this:

>KY676659.1 Eurycea nerea isolate MO42 cytochrome oxidase subunit 1 (CO1) gene, partial cds; mitochondrial

Which is perfect as the GenBank ID is the FASTA ID (everything after a space is the description) meaning you can import this without issue.

Or it might look like this example, which has some colons indicating the range of the genome we’re looking at:

>NC_006922.1:5177-6724 Pogona vitticeps mitochondrial DNA, complete genome

We don’t want that range in our ID as it will make cross-referencing harder.

Your sequence IDs might also have pipe characters in it | (although I might be mis-remembering as I can’t find an example at the moment).

Once you confirm that your FASTA file looks good and has the ID you want, you’ll be able to import your data like in this example.

Otherwise the real trick is going to be creating the FeatureData[Taxonomy]. You’ll need to cross-reference your genbank/accession IDs to find the associated taxon IDs, from there you’ll need to make a spreadsheet/TSV mapping the genbank/accession ID to the taxonomic string you want to see.
I think Entrez is going to be your friend in this endeavour, as you’ll need to perform a lot of queries. Here is a command line tool for querying and fetching data from Entrez.

Once you have a mapping of IDs to taxonomy strings, you’ll want to import that. The q2-feature-classifier tutorial has an example of importing taxonomy. Pay attention to the source-format, right now your options are:

  • TSVTaxonomyFormat which has this header: Feature ID<tab>Taxon on the first line
  • HeaderlessTSVTaxonomyFormat which has no header.

Otherwise they both are just a TSV file with IDs, tab, then taxonomy string (usually delimited by ;).

From there you could continue that tutorial and train a classifier, or perform other analysis as you see fit.

Let me know if that all makes sense and if anyone else has suggestions for streamlining the process, please jump in!


I have another question related to curating your own database: It seems as though the classifier weighs each sequence based on the number of times it appears in the fasta file? So do you need to curate your database so that the frequency of each species / sequence is similar to what you predict to be the relative abundance of each ASV (or taxonomic group) in your sample?

Hi @ebolyen,

Thanks @ebolyen for your thorough guide. My sequences are in fasta format, I’m now working on the taxonomy file. The E-utilities you pointed to seemed much simpler and straightforward than many scripts and stylesheets I found around:+1
I’d like to follow with a couple of questions:
1- Do I need to trim primers from fasta sequences while training, the reason I asked this is that with Bacterial 16S training we trimmed the primers used in sequencing but that was not the case with fungal ITS (Unite). Please correct me If I’m wrong.
2- Many sequences share the same taxon, some of them might be of equal length and others might be of different length. Should we keep all or remove replicates. I guess my question is somehow related to what @ckg89 posted above.
I appreciate your input and anyone else inputs,

Hi @ckg89 and @asr17!

No, each unique sequence appears exactly once in the fasta file.

Remember the goal is to predict the likelyhood that a given sequence belongs to some label (taxonomy in this case). Since biology is messy, we aren’t going to find exact matches in our reference database (and if we did, then our job is done!), and alignment is very expensive to perform, so a heuristic approach like naive-bayes is useful.

Since naive-bayes trains based on some features, we need to define those. In QIIME 2 we use k-mers as the feature to train on (not sequence counts). In case this mental analogy is useful, you could kind of think of each reference sequence being split up into it’s own mini-fasta file with all possible k-mers. That is the information the classifier is using to predict some other sequence.

I think it depends on how general you want the classifier to be. Classifiers which are trained only on your primer region can perform better, but they are only able to classify sequences that belong to that particular primer-pair.

You want to keep that information! It will help the classifier identify those particular variants.

Hope that helps!

1 Like

Thanks @ebolyen for your explanation!

Hi @asr17, @ckg89,

I just wanted to follow up on @ebolyen’s excellent explanations.

If you want to trim primers, you can using extract-reads. There is an example here. In our tests, we found that trimming sometimes helped and sometimes didn’t, so my advice is that it is not super-important. For it to work, the primer sequences must consistently be included in your reference sequences prior to trimming, or you will lose sequences when you run extract-reads. We ran into that problem with some fungal data sets. Sorry this advice isn’t simpler, if you have plenty of time on your hands you could try it both ways.

Regarding curating your database to adjust the frequency of each taxon, the answer is no. By default, the naive Bayes classifier will assume that each taxon is equally likely to be observed in your sample. This is standard practice. If you are able to predict the relative abundance of each taxon, then you can incorporate this information using the --i-class-weight optional input. At the moment this is an experimental feature, which we expect to have more advice (and a tutorial) on in the coming months. You could achieve the same effect by somehow altering the observed frequencies in the training set then setting --p-classify--fit-prior, but that seems fraught with danger to me.



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.