Creating a custom reference database

BenKaehler · March 28, 2018, 12:21am

I just wanted to follow up on @ebolyen's excellent explanations.

If you want to trim primers, you can using extract-reads. There is an example here. In our tests, we found that trimming sometimes helped and sometimes didn't, so my advice is that it is not super-important. For it to work, the primer sequences must consistently be included in your reference sequences prior to trimming, or you will lose sequences when you run extract-reads. We ran into that problem with some fungal data sets. Sorry this advice isn't simpler, if you have plenty of time on your hands you could try it both ways.

Regarding curating your database to adjust the frequency of each taxon, the answer is no. By default, the naive Bayes classifier will assume that each taxon is equally likely to be observed in your sample. This is standard practice. If you are able to predict the relative abundance of each taxon, then you can incorporate this information using the --i-class-weight optional input. At the moment this is an experimental feature, which we expect to have more advice (and a tutorial) on in the coming months. You could achieve the same effect by somehow altering the observed frequencies in the training set then setting --p-classify--fit-prior, but that seems fraught with danger to me.

Cheers,
Ben