Training v4v5 classifier for use with legacy 454 Data

Hiya Q2 Team! :wave: :qiime2: :sunglasses:

I've been working with some legacy 454 data we have, following along with the suggestions made in this helpful post, and it has run beautifully!

I removed adapters and primers from both Roche runs with cutadapt, denoised both with DADA2 (truncating at position 326), then merged the two runs.

I'm trying to train a classifier. We focused on the v4/v5 region, with an ideal product size of 363 bp (or 328 bp after primer removal). I know that for choosing min / max lengths for qiime feature-classifier extract-reads, you generally should focus on the expected size range for your primer set, but based on advice here I know that having somewhat more relaxed parameters for min / max is advisable. I chose to go with min - 250 and max - 600 based on a suggestion for others training v4/v5 classifiers (admittedly using different primers than us).

But I'm wondering about choosing the trunc-len parameter. Based on this note (from here): "The --p-trunc-len parameter should only be used to trim reference sequences if query sequences are trimmed to this same length or shorter. Paired-end sequences that successfully join will typically be variable in length. Single-end reads that are not truncated at a specific length may also be variable in length. For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed."

The other related posts here and here cover paired-end reads.

Where I'm confused is regarding the weirdness of my 454 data specifically. They're single-end reads technically, that I have truncated via DADA2 to 326. So, I feel like I should apply a --p-trunc-len of 326. But functionally, they're more like the equivalent of merged paired-end reads. So I'm not really sure my assumption is correct. Am I overthinking this?

Thanks for any insight you may be able to provide!!! :t_rex:

1 Like

Hi @454Data,

I think there is a subtle aspect being missed in regards to weather or not to truncate. Often truncating single-end / single-direction-reads (e.g. 454) is recommended to make sure that the variation you are observing is not due to technical variation, i.e. length variation due to the sequencing method / technology. This is not an issue with merged reads (with the primers removed) as you have the entire sequenced amplicon and the length variation you are observing is real biological signal.

This should help put the posts you referenced, and the comments below, into perspective.

You could truncate your reference database to the specified length of your 454 reads, if you are unsure that your 454 reads contain the entire sequence between the PCR primers. But if you have performed the appropriate trimming and quality checks of your 454 data, and are sure that each read does contain all of the sequence data between the primer-pairs, then you do not need to truncate your reference reads.

But just as a personal opinion, I often do not truncate my extracted amplicon reference reads to that of my sequenced amplicon data, even if I am analyzing shorter single-end reads. I often like to compare different data sets together (a mix of paired-end and single-end datasets) using the same classifier. This way I do not introduce potential taxonomic assignment biases that may result from differences in database construction.

As recommended in the other posts you can use RESCRIPt to compare the constructed reference databases.

I'm sure others will have thoughts on this. :slight_smile:

5 Likes