Hi @devonorourke & @Nicholas_Bokulich,
I have had similar issues with I
characters when using vsearch and other similar tools. Keep in mind that I
stands for the chemical Inosine / deoxyInosine, which you can order as part of your oligos (e.g. from IDT). That is, it functions literally as an ambiguous N
base. So, wherever you find and I
replace with N
. For example, take a look at the CO1 blocking primer we made in our swine diet paper.
As for the U --> T
bit… In the past, when I put that set of scripts together, several down stream tools did not like U
, or had problems when trying to map reads with T
against reference data containing U
. There were likely many other issues that I’ve long since forgotten, but it happened often enough that I just got into the habit of converting.
I normally follow the approach @Nicholas_Bokulich outlined. However, I have been known to be far more strict when making a reference database for very short marker sequences, e.g. various primer sets for the trnL (UAA) gene generate very short amplicons ~ 60 bp. Thus, any ambiguous bases over such short read lengths can be problematic when making your own reference database, as a query may more easily map to multiple references, at least more so than usual. In this case, I removed any reference sequence that contained any IUPAC ambiguity codes within this amplicon region of interest. Normally, I am not so conservative and only remove sequence data that contains stretches of several IUPAC ambiguity letters (e.g. a stretch of 4-7 IUPAC ambiguity letters).
The general approach: consider the length and variability of the marker gene from which you are building your reference sequences.
I hope this helps. I’m happy to hear that old bit of code is still helpful for some folk.
-Best wishes
-Mike