Hiya Q2 Team! :qiime2:
I've been working with some legacy 454 data we have, following along with the suggestions made in this helpful post, and it has run beautifully!
I removed adapters and primers from both Roche runs with cutadapt, denoised both with DADA2 (truncating at position 326), then merged the two runs.
I'm trying to train a classifier. We focused on the v4/v5 region, with an ideal product size of 363 bp (or 328 bp after primer removal). I know that for choosing min / max lengths for qiime feature-classifier extract-reads, you generally should focus on the expected size range for your primer set, but based on advice here I know that having somewhat more relaxed parameters for min / max is advisable. I chose to go with min - 250 and max - 600 based on a suggestion for others training v4/v5 classifiers (admittedly using different primers than us).
But I'm wondering about choosing the trunc-len parameter. Based on this note (from here): "The --p-trunc-len
parameter should only be used to trim reference sequences if query sequences are trimmed to this same length or shorter. Paired-end sequences that successfully join will typically be variable in length. Single-end reads that are not truncated at a specific length may also be variable in length. For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed."
The other related posts here and here cover paired-end reads.
Where I'm confused is regarding the weirdness of my 454 data specifically. They're single-end reads technically, that I have truncated via DADA2 to 326. So, I feel like I should apply a --p-trunc-len of 326. But functionally, they're more like the equivalent of merged paired-end reads. So I'm not really sure my assumption is correct. Am I overthinking this?
Thanks for any insight you may be able to provide!!!