V3V4 trim length and length parameters for extract-reads

x0li0013 · June 6, 2019, 2:39pm

Dear-Bod and qiime 2 developers
I would like to pre-train classifier of V3-V4 (341F, 805R) region with gg_99. I have the paired-end 16S seq with 300 base-paired read generated by MiSeq with trimmed forward length 270-290 basedand trimmed reverse length 210-230 base. Here is the commend line I'm going to use. I'm not sure what the number I should put with a question mark below:

qiime feature-classifier extract-reads
--i-squence 99_otus.qza
--p-f-perimer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-trunc-len 300 (?)
--p-min-length 200 (?)
--p-max-length 550 (?)
--o-reads ref-seqs.qza
Your help will be greatly appreciated. I went to the Qiime 2 workshop and didn't get a help for this specific question.
Thank you very much in advance!
xiaohong

lakshmi.anarayanan · June 6, 2019, 7:20pm

Thank you for this questions, Xiaohong. I'm in a similar situation and I would like to add a question to this thread.

If I do not want to use the -p-trunc-len command as I have paired-end reads, could I still use the –p-min-length and –p-max-length command?

Also, is it advisable to make the -p-min-length and -p-max-length more stringent based on the min and maximum length of our data?

Thank you!
Lakshmi

Nicholas_Bokulich · June 7, 2019, 12:16pm

See this section of the classifier training tutorial and read the NOTES at the bottom of the section, they answer all of these questions.

Yes! You read the tutorial and answered part of @x0li0013's question

absolutely. Those parameters restrict what sequences are accepted after extracting but before trimming; they are used for very different purposes in this command (trimming to match the exact read length usually of single-end reads, vs. filtering out abnormally short/long simulated amplicons, which are probably derived from mismatches)

No, these parameters are all about filtering out low-quality reads, not about conforming to the length of your data. Looking at your data may be a good way to assess the distribution of read lengths for your amplicon target, but you should probably give a little more leeway since you are attempting to simulate PCR with extract-reads so you want to capture more of the natural length variation present in the reference database.

Good luck both of you!

Deni_Ribicic · June 13, 2019, 1:10pm

Hi Nicholas,

I have to say that I am puzzled as well regarding training the classifier, and that the training tutorial is not really into the detail explaining different parameters- or at least it is hard for me to understand it.

Just a short info what I want to do- I'd like to train classifier based on pro341F and pro805R primers and Silva-132 db. My sequences are paired-end with each read of about 300 bp.

The total DNA stretch which should be covered by these primer pairs is 464 bp, taking off 20 bp from each paired read (5'---3', quality/primer trimming) would give me about 424 bp after the reads are paired.
Does this mean that I can use following parameters -p-min-length of 400 and -p-max-length of 450 in order to extract sequences for training which would be targeting this previously calculated region (424 bp)? This is how I understand what these two different parameters are doing.

Now, where I am getting puzzled is when looking at provenance of the silva classifier generated by you guys (silva-132-99-515-806-nb-classifier.qza).
You have used min_length of 50 and max_length of 0, meaning that no sequences would be extracted since max_length 0? After my understanding I would expect here you to use something like min_length 200 and max_length say 300, since the region covered by the primer pairs is about 290 bp.

Would really appreciate if this could be explained in more detail, since this is actually imo the most important part/step of the pipeline.

Best,
Deni

Mehrbod_Estaki · June 13, 2019, 7:04pm

Hi @Deni_Ribicic,

There is a delicate balance between tutorials that have excessive info that makes them daunting to read vs parameters that are important and commonly used and need to be described better. If you have any suggestions on how to make these better feel free to share or better yet submit a PR .
You can also find a bit more detail about each parameter in all plugins' help files. For example for the extract-read reads plugin or in command line qiime feature-classifier extract-reads --help

Yes, exactly!

Not exactly. As per the extract-reads help file:

 --p-max-length INTEGER  Maximum amplicon length. Longer amplicons are
    Range(0, None)        discarded. Applied before trimming and truncation,
                          so plan accordingly. Set to zero (default) to
                          disable max length filtering.           [default: 0]

The default value 0 just disables max-length, meaning there is no limit on max length.

Debatable but clarity is always important!

Deni_Ribicic · June 14, 2019, 7:38am

Hi @Mehrbod_Estaki,

Thanks for you prompt answer and clarification!

I have to agree with this

Should have probably done that beforehand
Reading it now, makes it definitely more clear.

Thanks again,
Deni