Picking values for --p-min-length and --p-max-length in qiime feature-classifier extract-reads

Hi there,

We’ve done some 16S rRNA amplicon sequencing with primers that Illumina says are used to sequence the V3 and V4 variable regions of the 16S rRNA gene.

Forward primer:
5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-3’

Reverse primer:
5’-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-3’

Here's my qiime dada2 denoise-paired command:

qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux.qza \
--p-trim-left-f 17 \
--p-trim-left-r 21 \
--p-trunc-len-f 294 \
--p-trunc-len-r 216 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats stats.qza \
--p-n-threads 8 \
--verbose

And here's my qiime feature-classifier extract-reads command:

qiime feature-classifier extract-reads \
  --i-sequences silva-138-99-seqs.qza \
  --p-f-primer CCTACGGGNGGCWGCAG \
  --p-r-primer GACTACHVGGGTATCTAATCC \
  --p-min-length 400 \
  --p-max-length 500 \
  --o-reads ref-seqs.qza

My understanding is that the primers we're using produce amplicons of ~464 bp in length (see this forum post). So, by using --p-trim-left-f 17 and --p-trim-left-r 21 in the qiime dada2 denoise-paired step, I'd end up with amplicons of length (464 − 17 − 21) = 426 bp. Is that correct? If so, what would be the best values to use for --p-min-length and --p-max-length in qiime feature-classifier extract-reads?

I notice that someone used --p-min-length 400 and --p-max-length 450 in a similar situation (with the same primers and similar trimming), and got a thumbs up from @Mehrbod_Estaki. When I ran my analysis initially, I used --p-min-length 400 and --p-max-length 500 in the qiime feature-classifier extract-reads command, but I guess I'm wondering if there's any significant difference between using, say ...

  • a tight interval like: --p-min-length 420 and --p-max-length 430
  • or, a wider interval like: --p-min-length 400 and --p-max-length 450
  • or, an even wider interval like: --p-min-length 400 and --p-max-length 500

Is there any 'rule of thumb' people use for this? Or does it even matter very much?

Some relevant info (from the qiime feature-classifier extract-reads usage page):

--p-min-length INTEGER  Minimum amplicon length. Shorter amplicons are
    Range(0, None)        discarded. Applied after trimming and truncation, so
                          be aware that trimming may impact sequence
                          retention. Set to zero to disable min length
                          filtering.                             [default: 50]
--p-max-length INTEGER  Maximum amplicon length. Longer amplicons are
    Range(0, None)        discarded. Applied before trimming and truncation,
                          so plan accordingly. Set to zero (default) to
                          disable max length filtering.           [default: 0]

Thanks as always for the help! :blush:

Kevin

Hi @KQUB,

The min/max parameters are used to trim the "extracted" primer-specific region from your reference database, in your case SILVA. So, as long as your reference sequences encompass the whole region extracted by your primers it should be fine. While it's been shown that a classifier trained on a specific region can improve classification a bit, I'm not sure anyone has benchmarked the effect of that additional trimming. My gut feeling is as long as the reference is equal or longer than your query sequence, those small length differences wouldn't really affect classification. My recommendation for a region such as v3-v4 that has variable lengths, is to just not do any additional trimming.

2 Likes

Hi, @Mehrbod_Estaki.

Thanks for your response!

So, you think that, when running the qiime feature-classifier extract-reads command, the difference between something like --p-min-length 400 and --p-max-length 450 or --p-min-length 400 and --p-max-length 500 isn't very significant?

Do you mean that your current recommendation, when using primers targeting the V3–V4 region, is to set both --p-min-length and --p-max-length to zero?

Thanks again for your time!

1 Like

Hi Kevin,

I think your assumption regarding the amplicon length is correct. But for training the classifier I wouldn't go with too tight interval params. You may be risking removing biologically relevant sequences due to differences in 16S rRNA gene in different prokaryotic groups (as you mention also in your previous post).

I am pretty comfortable with --p-min-length 400 and --p-max-length 500 as this gives me enough error margin, and certainty that I am recapturing relevant sequences and excluding some larger artefacts :slight_smile:

3 Likes

Thanks for your input, Deni! :blush:

Hi @KQUB,

I agree with @Deni_Ribicic here, I think in your scenario you shouldn't use such strict parameters. Either don't trim at all (what I would do) or use something relaxed such as @Deni_Ribicic's recommended 400/500. This is especially important when targeting a variable region like the V3-V4, the last thing you want is to introduce bias of a certain clade that is either longer or shorter than your arbitrary parameters.

That's what I would, to be play it safe :man_shrugging: .

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.