Handling ambiguous nucleotides in database construction

SoilRotifer · February 26, 2019, 9:28pm

I have had similar issues with I characters when using vsearch and other similar tools. Keep in mind that I stands for the chemical Inosine / deoxyInosine, which you can order as part of your oligos (e.g. from IDT). That is, it functions literally as an ambiguous N base. So, wherever you find and I replace with N. For example, take a look at the CO1 blocking primer we made in our swine diet paper.

As for the U --> T bit... In the past, when I put that set of scripts together, several down stream tools did not like U, or had problems when trying to map reads with T against reference data containing U. There were likely many other issues that I've long since forgotten, but it happened often enough that I just got into the habit of converting.

I normally follow the approach @Nicholas_Bokulich outlined. However, I have been known to be far more strict when making a reference database for very short marker sequences, e.g. various primer sets for the trnL (UAA) gene generate very short amplicons ~ 60 bp. Thus, any ambiguous bases over such short read lengths can be problematic when making your own reference database, as a query may more easily map to multiple references, at least more so than usual. In this case, I removed any reference sequence that contained any IUPAC ambiguity codes within this amplicon region of interest. Normally, I am not so conservative and only remove sequence data that contains stretches of several IUPAC ambiguity letters (e.g. a stretch of 4-7 IUPAC ambiguity letters).

The general approach: consider the length and variability of the marker gene from which you are building your reference sequences.

I hope this helps. I'm happy to hear that old bit of code is still helpful for some folk.

-Best wishes
-Mike