How to train the classifier with multiple mixed forward primers?

Hi,

I am analysing V1-V2 16S rRNA sequence data. I want to use qiime feature-classifier extract-reads to extract reads and train a classifier.

However, this data has mixed primers:

V1-V2 MiSeq primers (parts in bold are adapter sequences)
Forward: These primers are mixed at a 4:1:1:1 ratio (28F-YM is the 4)

28F-YM: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GAGTTTGATYMTGGCTCAG 
28F-Borrellia: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GAGTTTGATCCTGGCTTAG 
28FChloroflex: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GAATTTGATCTTGGTTCAG 
28F-Bifdo: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GGGTTCGATTCTGGCTCAG

I came across this one: How to train the classifier with multiple reverse primers?. but my case has so many differences. But my case seems more complicated. I have primers, and they seem quite different.

What should I do?
Any advices would be highly appreciated! Thanks!

Kind regards,
Birong

Hello,

Let's start with the same method that was in the linked thread explained in detail by @colinbrislawn - we need a metric to discuss whether primers are different or not.
So please, provide the information on how different these primers are.

Cheers
V

1 Like

Hi V,

Thanks for your kind help.

How about this:

28F-YM:        GAGTTTGATYMTGGCTCAG 
28F-Borrellia: GAGTTTGATCCTGGCTTAG 
28FChloroflex: GAATTTGATCTTGGTTCAG 
28F-Bifdo:     GGGTTCGATTCTGGCTCAG


28F-YM vs 28F-Borrellia           ==   3
28F-YM vs 28FChloroflex           ==   4
28F-YM vs 28F-Bifdo               ==   4
28F-Borrellia vs 28FChloroflex    ==   4
28F-Borrellia vs 28F-Bifdo        ==   4
28FChloroflex vs 28F-Bifdo        ==   6


(19-4) differences / 19 bp length == 78.95% similar
--p-identity 0.7/0.8 ?

However, another problem is these primers are mixed at a 4:1:1:1 ratio (28F-YM is the 4), how to take this into account? Should I use 28F-YM & --p-identity 0.7/0.8?

Thank.

HI @Birong ,

It appears that all of these primers bind to the same location, and only differ by a few bases. You could combine these 4 sequences into a pseudo-sequence using the IUPAC ambiguity codes like this:

An extreme case would result in something like this:
GRRTTYGATYMTGGYTYAG
^^Warning: This might be too ambiguous and lead to spurious hits.

Since we can allow for a certain amount of mis-matches lets try something like you suggested by slightly lowering the identity, or make a new sequence string, (see below). I retained the initial ambiguous IUPAC bases added additional ones where the common base had a stronger bond, (i.e. a G or a C).
GARTTTGATYMTGGCTYAG
^^This still might be too ambiguous, but you get the idea

:point_right: Another option, which I'd recomend, is to use only one of the primer sets. Specifically, the one that uses 28F-YM primer and use the resulting extracted sequences as a reference pool for guiding the extraction of this region without the use of additional primer pairs. That is, follow the approach outlined here.

-Cheers!
-Mike

2 Likes

Hi Mike,

Thanks for you reply! Learned a lot! Wii try!

I guess the last one also applies to qiime rescript get-silva-data, like this:

## get-silva-data
qiime rescript get-silva-data \
    --p-version '138.1' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences silva-138-99-seqs.qza \
    --o-silva-taxonomy silva-138-99-tax.qza


## Dereplicate 
qiime rescript dereplicate \
    --i-sequences silva-138-99-seqs.qza \
    --i-taxa silva-138-99-tax.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences silva-138-99-seqs-derep.qza \
    --o-dereplicated-taxa silva-138-99-tax-derep.qza

##  extract-reads
qiime feature-classifier extract-reads \
   --i-sequences silva-138-99-seqs-derep.qza \
   --p-f-primer GAGTTTGATYMTGGCTCAG  \ #28F-YM
   --p-r-primer GCTGCCTCCCGTAGGAGT \ #388R
   --p-n-jobs 8 \
   --o-reads ilva-138-99-seqs-segments.qza

Many thanks!
Birong

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.