I am analysing V1-V2 16S rRNA sequence data. I want to use qiime feature-classifier extract-reads to extract reads and train a classifier.
However, this data has mixed primers:
V1-V2 MiSeq primers (parts in bold are adapter sequences)
Forward: These primers are mixed at a 4:1:1:1 ratio (28F-YM is the 4)
28F-YM: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GAGTTTGATYMTGGCTCAG
28F-Borrellia: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GAGTTTGATCCTGGCTTAG
28FChloroflex: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GAATTTGATCTTGGTTCAG
28F-Bifdo: **TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG** GGGTTCGATTCTGGCTCAG
Let's start with the same method that was in the linked thread explained in detail by @colinbrislawn - we need a metric to discuss whether primers are different or not.
So please, provide the information on how different these primers are.
28F-YM: GAGTTTGATYMTGGCTCAG
28F-Borrellia: GAGTTTGATCCTGGCTTAG
28FChloroflex: GAATTTGATCTTGGTTCAG
28F-Bifdo: GGGTTCGATTCTGGCTCAG
28F-YM vs 28F-Borrellia == 3
28F-YM vs 28FChloroflex == 4
28F-YM vs 28F-Bifdo == 4
28F-Borrellia vs 28FChloroflex == 4
28F-Borrellia vs 28F-Bifdo == 4
28FChloroflex vs 28F-Bifdo == 6
(19-4) differences / 19 bp length == 78.95% similar
--p-identity 0.7/0.8 ?
However, another problem is these primers are mixed at a 4:1:1:1 ratio (28F-YM is the 4), how to take this into account? Should I use 28F-YM & --p-identity 0.7/0.8?
It appears that all of these primers bind to the same location, and only differ by a few bases. You could combine these 4 sequences into a pseudo-sequence using the IUPAC ambiguity codes like this:
An extreme case would result in something like this: GRRTTYGATYMTGGYTYAG ^^Warning: This might be too ambiguous and lead to spurious hits.
Since we can allow for a certain amount of mis-matches lets try something like you suggested by slightly lowering the identity, or make a new sequence string, (see below). I retained the initial ambiguous IUPAC bases added additional ones where the common base had a stronger bond, (i.e. a G or a C). GARTTTGATYMTGGCTYAG ^^This still might be too ambiguous, but you get the idea
Another option, which I'd recomend, is to use only one of the primer sets. Specifically, the one that uses 28F-YM primer and use the resulting extracted sequences as a reference pool for guiding the extraction of this region without the use of additional primer pairs. That is, follow the approach outlined here.