Training classifier after trimming primers

xchromosome · May 30, 2022, 3:07pm

Hi there.

I've been searching the forum every day for about 2 weeks now and still haven't found an answer to this! I have been following the moving pictures tutorial, but also using the forum a lot for help and adding extra steps that seem to be important but missing from the tutorial. In searching the forums, I have come across several suggestions that it's good practice to trim the primer regions from the sequences, either at the Deblur/DADA2 step, or using cutadapt. By "primer regions", I mean the area targeted by the primers, not the barcodes etc.

I have MiSeq 300 PE reads covering the V4 region using the newer primer set:
515f Modified - GTGYCAGCMGCCGCGGTAA Parada et al.
806r Modified GGACTACNVGGGTWTCTAAT Apprill et al.

So, I used Deblur and the --p-left-trim-len parameter to trim 19 bases from the left, which should be my primer region. I have 300 bp paired-end reads (MiSeq). After a bit of trial and error with different trim lengths and looking at how many sequences per sample I had left after Deblur, I set the --p-trim-length to 269, which gives me 250 bp sequences after the primer region is removed.

I'm now at the taxonomy stage and need to train, or pick, a classifier. I see that weighted classifiers are considered superior so I want to use that. I see there is an animal gut one available on Zenodo. Cool. My question is two-fold.

If I train my own classifier, will it matter that I've already removed the sequence-specific primer regions from my reads?
If I use a pre-trained one, does it matter whether the pre-trained classifier has been trained using the older Caporaso primer set while I'm using a very slightly different one?

Thanks so much for your help!

colinbrislawn · May 31, 2022, 3:39pm

Hello @xchromosome,

I think the central question is about trimming primers, so let's start there.

It's essential to remove barcodes, otherwise the same feature from different samples would be split into different ASVs due to the barcode.

It's not essential to remove the region targeted by the primer, as it's just a conserved region around the variable region, so it should not matter much

It can be unnescessary to remove the primer it you use a sequencing chemistry that starts the sequencing-by-synthesis using the same primer used to amplify the amplicon, instead of the generic Illumina adapter. In this case, R1 and R2 start after the primer so there's no primer region there to remove.

You can align your ASVs against a full length 16S gene to see if the region being primed it showing up in your reads, or not!

Cool

It shouldn't. I don't think those primer regions are included in the sklearn pre-trained databases, and it won't matter to q2-vsearch due to how vsearch calculates alignment scores.

Well... it shouldn't, as the old and new primers target the same region. However, I wonder if primer biases get embedded in the skbio classifier and subtly skew results.

If you do want to build your own classifier, try using RESCRIPt, the same plugin used to build the official pre-trained databases!

system · July 1, 2022, 9:39pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.