V3-V4 region, training a classifier, parameters

Hi, I have a ref. sequence set that consists of the full length of the 16S rRNA gene. I need to train a classifier by using a primer set:
S-D-Bact-0341-b-S-17, 5′-CCTACGGGNGGCWGCAG-3′ and S-D-Bact-0785-a-A-21, 5′-GACTACHVGGGTATCTAATCC-3 (Klindworth et al., 2013) Illumina V3-V4 protocol says that this primer pair covers about 460 bp. In Qiime 2 "Training feature classifiers with q2-feature-classifier — QIIME 2 2023.9.2 documentation" tutorial it is given that:
qiime feature-classifier extract-reads
--i-sequences 85_otus.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer GGACTACHVGGGTWTCTAAT
--p-trunc-len 120
--p-min-length 100
--p-max-length 400
--o-reads ref-seqs.qza

I just couldn't figure out which number should I determine for the parameters; --p-trunc-len, --p-min-length , -p-max-length in order to get V3-V4 region from my reference sequences (full-length). And should I add any other parameter?

P.S. Just in case for any recommendations for the usage of already existing classifiers: Of course, there are trained classifiers but our taxonomic names (text) data will be a little bit different that's why I need to prepare a classifier.

Hi,

It can always feel a bit trail and error / slightly arbitrary at this point, but you can make informed decisions by following the walk through, so don’t worry! Sometimes you will feel you are going around in circles :sweat_smile:

Choosing parameter depends on a range of things (for example, your sequencing approach and target region length). In the tutorial you link to there are notes sections under the example that explains how to make an informed decision about these parameters.

For example for --p-trunc-length it says “ query sequences are trimmed to this same length or shorter” and goes on to explain that for the “classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed.”

The next notes give similar insights in to the --p-min-length and --p-max-length choices, these can be used to remove the amplicons far outside the numbers you were aiming for and mentions how the additional trim parameters work.

So, if you think about your if your target amplicon is ~ 460bp, what is far outside that and what was your sequencing read length?

If you look across the forum many people have asked about training classifiers for the same region, maybe see what they settled on as well? For example, there is some useful conversation here or here

Hope that helps, :slightly_smiling_face: :dna:

Vic

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.