When you train a classifer, you often train for the region and you want something that makes sense for your region (or you just use the full length classifier). The truncation is to check for things that dont make sense for that region. Like, if my amplicon should be about 450 basepairs, Im probably going to get rid of something thats less than 200.
My advice is this:
If you’re doing 16s sequencing rarely, or this is your first time, a pre-trained, full length classifer is fine. The specificity isnt necessarily worth the expense, frustration and effort and what you get out will be good enough, especially given all the other noise. I sometimes do this if Im working with a weird primer pair, or just want something quick and easy. Think of it like being a novice or occasional baker who uses a box cake mix from the store. It makes a pretty tasty cake!
On the other hand, if you or your group does this regularly with the same region and primers all the time, it may make sense for someone to train a region-specific classifier and for everyone to re-use it. (This is what my bioinformatics group does.) If you’re going to use it all the time, a trained classifier will be worth it. (Similarly, if you eat cake all the time, Ive heard rumors that you should make it yourself.)
I hope I understood your question correctly, please help me if I don’t!
I understands for using primer sequence to shorten the classifier.
But I don’t understand the options, ‘–p-trunc-len’ , ‘–p-min-length’ , and ‘–p-max-length’
I think that ‘–p-f-primer’ and ‘–p-r-primer’ options make shorten the full-length 16s sequences between primer sequences.
If it is right, why should I used the other options for training?
I understands what these options doing, but I don’t understands why these options needed.
If I just want to train the classifier just by two primers, I do not have to use the other options (trunc, min-, max-length), right?
--p-min-length is the minimum length for your reads. So, if, after your use the primer pair, your read is less than 100 bp, you’re going to discard it. --p-max-length is the reverse: if a read is longer than 400bp, you’ll discard it. This gets used to help limit reads that shouldn’t be in the region. Maybe they’re too short, maybe they’re too long, but this is kind of a sanity filter. It think for your ~120bp region, I would shorten my --p-min-length if you’re setting it. I think you can skip both of these, the default parameters will probably work.
The --p-trunc-length cuts off the amplicon at a specific length and requires everything to be that long. I think for you, you may want to leave that alone.
But, again, Im going to offer the unsolicited advice that you can get a pre-trained full length classifer for Silva and Greengenes that’s probably way easier if this is your first time.