Training feature classifier: values for --p-trunc-len; --p-min-length; and --p-max-length

In the code below:

qiime feature-classifier extract-reads
–i-sequences 85_otus.qza
–p-f-primer GTGCCAGCMGCCGCGGTAA
–p-r-primer GGACTACHVGGGTWTCTAAT
–p-trunc-len 120
–p-min-length 100
–p-max-length 400
–o-reads ref-seqs.qza

If we trimmed the forward and reverse sequences differently at the dada2 step, what --p-trunc-len should we use here? So for me, the forward read was not cut and stayed at 250nts, while the reverse was lower quality and so was cut at 244.

Also, how do we choose values for the --p-min-length and --p-max-length?

Thanks

1 Like

Hi @Negin,
Please see the tutorial, which has answers to all these questions and more! See the “notes” in this section:
https://docs.qiime2.org/2019.10/tutorials/feature-classifier/#extract-reference-reads

Hi Nicholas,

Thank you for your response. So does the sentence below from the notes of the tutorial mean that I would need to use --p-trunc-len of 250 since my original sequences are 250nts, so basically I should use the size before trimming?

“For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed.”

No — since you are using paired-end sequences you should not truncate the extracted reference sequences.

oh okay thank you! So this means that I should probably go with the default options for –p-trunc-len, –p-min-length and –p-max-length, so just leave these arguments out.

for paired-end reads, do not truncate the reference sequences with extract-reads

min and max length are another story, though. You should check the literature to see what the expected size range is for your primer set — or just switch these off and then check the length of the extracted sequences to see what the length distribution is, and decide for yourself if there are abnormally short or long sequences that need to be winnowed out.

Good luck!

1 Like

Thank you for your help!

1 Like

I extracted the reads using my primers with default options and used the code below to visualize, but when I try to visualize it, it gives me a blank page. I thought these were sequences so I could use qiime feature-table tabulate-seqs but probably not. How else can I look at the file to see what length the extracted sequences are?

qiime feature-table tabulate-seqs
–i-data qza/silva_132_99_v3v4_eub-euf_extracted.qza
–o-visualization qzv/silva_132_99_v3v4_eub-euf_extracted.qzv &

The file seem to be too big for me to upload here.

You can use tabulate-seqs — I am not sure why the page will not load, maybe a browser issue? Or the file is too large to display?

Yes, the file is large indeed!

Okay that makes sense… also makes sense that it would be really large if you are extracting sequences from a reference database (as opposed to a collection of ASVs or OTUs from a real dataset). So I think this is effectively a browser issue, the file may be too large to load, which we occasionally see e.g., with really large emperor plots.

Try this: just extract the QZV file and grab the length distribution summary like this:

$ qiime tools extract --input-path rep-seqs.qzv --output-path .
Extracted rep-seqs.qzv to directory 789ea3c6-8ac4-442a-adbd-d80738359b71
$ head 789ea3c6-8ac4-442a-adbd-d80738359b71/data/seven_number_summary.tsv 
Quantile	Value
0.02	120
0.09	120
0.25	120
0.5	120
0.75	120
0.91	120
0.98	120

Note: you will need to modify the filepath to reflect the ID that is printed to the screen; so see how I got this message: Extracted rep-seqs.qzv to directory 789ea3c6-8ac4-442a-adbd-d80738359b71 and then used that ID as the directory name in the following line.

Hi Nicholas,

I ran these codes and there isn’t a seven_number_summary in the data folder. I checked there physically in addition to running the code. Strangely, there is one in my downloads folder from 7 days ago and I don’t remember generating that. But anyways, there isn’t one related to the task I just ran. Maybe there is an issue with the file I created.

I was trying to find the normal range for the amplicon for my primer (EUBF-EUBR) in the literature too and I was not very successful. I found this link that seems to show between 100-500 for v3v4 which should work for me although my primer is a bit longer than the one shown here:
https://help.ezbiocloud.net/comparison-between-v3v4-and-full-length-sequencing-of-16s-rrna-genes/

Sounds like you running an older release of QIIME 2. This length summary was added a couple releases ago, I believe.

As long as the primers are hitting the same site, you can go off that info

Another good place to get info like this is the forum! Here is a recent topic describing expected length for V3V4, though the length range is not stated, only (presumably) the mean:

Based on these findings, 300-600 is probably a fine, permissive range for you to use (though in practice the variance is probably much less, since most 16S regions don't have that much length variation)

1 Like

I am using qiime2/2019.10 which should be the latest version?

Regarding the amplicon size, I have reads that are below 300 in my data. My average read length was 320 nts. What would happen to those that are smaller. Would they get removed in the further taxonomy assignment?

okay, so maybe set 200 nt as a lower bound. Sounds like you may be using different primers compared to that other topic.

1 Like

I am using these primers:

Forward: EUB_F 5’-TCCTACGGGAGGCAGCAGT (19 nts) ​
Reverse: EUB_R 5’-GGACTACCAGGGTATCTAATCCTGTT (26 nts)

Actually, I had used the wrong file. Sorry about that. Here is the range that seems reasonable. Does that mean that I don't need to redo the extract reads with max and min parameters cause I had them at default?

image

That is correct!

Good luck!

1 Like

Great thanks! :slightly_smiling_face:

yes,this is a really browser problem . The Google browser can works with it