Primer sequence for feature classifier

Hello,

I’m new to QIIME and have a quick question about using the feature classifier. I am training a 18S feature classifier using MaarjAM. I used a mix of 7 (each forward and reverse) primers to amplify the 18S region.

Here are the forward primers:

ACACTGACGACATGGTTCTACACAGCCGCGGTAATTCCAGCT
ACACTGACGACATGGTTCTACAGCAGCCGCGGTAATTCCAGCT
ACACTGACGACATGGTTCTACAAGCAGCCGCGGTAATTCCAGCT
ACACTGACGACATGGTTCTACATAGCAGCCGCGGTAATTCCAGCT
ACACTGACGACATGGTTCTACAGTAGCAGCCGCGGTAATTCCAGCT
ACACTGACGACATGGTTCTACACGTAGCAGCCGCGGTAATTCCAGCT
ACACTGACGACATGGTTCTACAACGTAGCAGCCGCGGTAATTCCAGCT

Here are the reverse primers:

TACGGTAGCAGAGACTTGGTCTGAACCCAAACACTTTGGTTTCC
TACGGTAGCAGAGACTTGGTCTCGAACCCAAACACTTTGGTTTCC
TACGGTAGCAGAGACTTGGTCTTCGAACCCAAACACTTTGGTTTCC
TACGGTAGCAGAGACTTGGTCTATCGAACCCAAACACTTTGGTTTCC
TACGGTAGCAGAGACTTGGTCTCATCGAACCCAAACACTTTGGTTTCC
TACGGTAGCAGAGACTTGGTCTTCATCGAACCCAAACACTTTGGTTTCC
TACGGTAGCAGAGACTTGGTCTATCATCGAACCCAAACACTTTGGTTTCC			

Others I have talked to have said they have used the longest F/R primer sequences to train the classifier. Is this correct, or would it make more sense to use just the sequence that is common to all of the primers? Please let me know if I can provide any more information. Thanks!

Mariah

Hi @mmcintosh,

Welcome! Thanks for posting!

Use just the common sequences at the 3' ends. It looks like the primer constructs you posted may contain more than the actual primer sequence. As far as I can tell, the actual primers are the 3' ends of each construct, CAGCCGCGGTAATTCCAGCT and GAACCCAAACACTTTGGTTTCC. Those are the primers that should be used for extracting 18S amplicons from the reference sequences when training a feature classifier. Using the longest sequences that you posted might result in some sequences being excluded from the classifier, since presumably some of the internal sequences are not shared by all species (otherwise why do you use multiple primers?). Don't use the shortest constructs, either, for the same reason — use the 3' ends that are shared by all constructs.

I am unfamiliar with this sort of setup so am very intrigued. Why do you use this mix of primers instead of a single primer? Is the 5' end actually biological sequence, or some type of adapter? What are the internal sequences that are not shared among primers? Either way, you should be using the common sequence at the 3' ends of each primer, otherwise you will only extract reads that hit the longest primers.

I hope that helps!

2 Likes

Hey @Nicholas_Bokulich,

If you were to right align the fwd and rev. sequences above, you would have, from right to left, the ~18 bp primer sequence, 2 linker bps, a heterogeneity spacer of variable length (0-6 bp, to increase sequence variability and reduce problems related to low sequence diversity), and then the CS1/CS2 tag sequence.

Our approach is similar to that discussed in the two papers below:

3 Likes

Thanks for clarifying @Lorinda! I was thinking that it was probably a phasing approach but wasn’t sure.

So that settles it then, @mmcintosh — you definitely don’t want to include the linker, heterogeneity spacer, or tag sequences when extracting sequences for training a feature classifier! Even the linker may cause problems (by definition it should have low homology to reference sequences) so make sure that you are using the actual binding primer sequence itself.

Thank you @Nicholas_Bokulich and @Lorinda!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.