Training Silva database

Hello there,

I have a question for database training.
In extract reference reads, is it OK that no truncation of sequences?

In the Qiime2Docs (https://docs.qiime2.org/2020.2/tutorials/feature-classifier/), It suggests truncating the database sequence length based on QC results of sequencing results (120 base).

Meaing this that, should train 16S database everytime for analysis based on sequencing quality?

Thanks always, I learned a lot here.

Best,
Jun

Hi @junkim83,

When you train a classifer, you often train for the region and you want something that makes sense for your region (or you just use the full length classifier). The truncation is to check for things that dont make sense for that region. Like, if my amplicon should be about 450 basepairs, Im probably going to get rid of something thats less than 200.

My advice is this:

If you’re doing 16s sequencing rarely, or this is your first time, a pre-trained, full length classifer is fine. The specificity isnt necessarily worth the expense, frustration and effort and what you get out will be good enough, especially given all the other noise. I sometimes do this if Im working with a weird primer pair, or just want something quick and easy. Think of it like being a novice or occasional baker who uses a box cake mix from the store. It makes a pretty tasty cake! :cake:

On the other hand, if you or your group does this regularly with the same region and primers all the time, it may make sense for someone to train a region-specific classifier and for everyone to re-use it. (This is what my bioinformatics group does.) If you’re going to use it all the time, a trained classifier will be worth it. (Similarly, if you eat cake all the time, Ive heard rumors that you should make it yourself.)

I hope I understood your question correctly, please help me if I don’t!

Best,
Justine

Hello Justine,

Always thanks for your help.

I know that trained classifier is necessary.
Sorry for my words to make you confused.

Below is the commands for training,


qiime feature-classifier extract-reads
–i-sequences 85_otus.qza
–p-f-primer GTGCCAGCMGCCGCGGTAA
–p-r-primer GGACTACHVGGGTWTCTAAT
–p-trunc-len 120
–p-min-length 100
–p-max-length 400
–o-reads ref-seqs.qza


I understands for using primer sequence to shorten the classifier.
But I don’t understand the options, ‘–p-trunc-len’ , ‘–p-min-length’ , and ‘–p-max-length’

I think that ‘–p-f-primer’ and ‘–p-r-primer’ options make shorten the full-length 16s sequences between primer sequences.

If it is right, why should I used the other options for training?

I understands what these options doing, but I don’t understands why these options needed.

If I just want to train the classifier just by two primers, I do not have to use the other options (trunc, min-, max-length), right?

Thanks again!

Best,
Jun

Hi @junkim83,

--p-min-length is the minimum length for your reads. So, if, after your use the primer pair, your read is less than 100 bp, you’re going to discard it. --p-max-length is the reverse: if a read is longer than 400bp, you’ll discard it. This gets used to help limit reads that shouldn’t be in the region. Maybe they’re too short, maybe they’re too long, but this is kind of a sanity filter. It think for your ~120bp region, I would shorten my --p-min-length if you’re setting it. I think you can skip both of these, the default parameters will probably work.

The --p-trunc-length cuts off the amplicon at a specific length and requires everything to be that long. I think for you, you may want to leave that alone.

But, again, Im going to offer the unsolicited advice that you can get a pre-trained full length classifer for Silva and Greengenes that’s probably way easier if this is your first time.

Best,
Justine