Thank you for your response. So does the sentence below from the notes of the tutorial mean that I would need to use --p-trunc-len of 250 since my original sequences are 250nts, so basically I should use the size before trimming?
“For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed.”
for paired-end reads, do not truncate the reference sequences with extract-reads
min and max length are another story, though. You should check the literature to see what the expected size range is for your primer set — or just switch these off and then check the length of the extracted sequences to see what the length distribution is, and decide for yourself if there are abnormally short or long sequences that need to be winnowed out.
I extracted the reads using my primers with default options and used the code below to visualize, but when I try to visualize it, it gives me a blank page. I thought these were sequences so I could use qiime feature-table tabulate-seqs but probably not. How else can I look at the file to see what length the extracted sequences are?
Okay that makes sense… also makes sense that it would be really large if you are extracting sequences from a reference database (as opposed to a collection of ASVs or OTUs from a real dataset). So I think this is effectively a browser issue, the file may be too large to load, which we occasionally see e.g., with really large emperor plots.
Try this: just extract the QZV file and grab the length distribution summary like this:
Note: you will need to modify the filepath to reflect the ID that is printed to the screen; so see how I got this message: Extracted rep-seqs.qzv to directory 789ea3c6-8ac4-442a-adbd-d80738359b71 and then used that ID as the directory name in the following line.
I ran these codes and there isn’t a seven_number_summary in the data folder. I checked there physically in addition to running the code. Strangely, there is one in my downloads folder from 7 days ago and I don’t remember generating that. But anyways, there isn’t one related to the task I just ran. Maybe there is an issue with the file I created.
Sounds like you running an older release of QIIME 2. This length summary was added a couple releases ago, I believe.
As long as the primers are hitting the same site, you can go off that info
Another good place to get info like this is the forum! Here is a recent topic describing expected length for V3V4, though the length range is not stated, only (presumably) the mean:
Based on these findings, 300-600 is probably a fine, permissive range for you to use (though in practice the variance is probably much less, since most 16S regions don’t have that much length variation)
I am using qiime2/2019.10 which should be the latest version?
Regarding the amplicon size, I have reads that are below 300 in my data. My average read length was 320 nts. What would happen to those that are smaller. Would they get removed in the further taxonomy assignment?
Actually, I had used the wrong file. Sorry about that. Here is the range that seems reasonable. Does that mean that I don’t need to redo the extract reads with max and min parameters cause I had them at default?
Yes, I used Safari instead of Google chrome and was able to see it for a few seconds. Unfortunately, the seven-numbers file only shows values above 2% and below 98%. The actual values have a min of 52 and maximum of 1871 so there are erroneous reads there. I can either remove those but then I really don’t know what the expected range is or to find the expected range. The original paper that introduced eubf-eubr mentions amplicon size of 466 bps but does not specify a range.
Also, this length, does it include the primer as well since we are removing the primers in dada2.