Handling ambiguous nucleotides in database construction

In looking through a terrific script detailing an approach that @SoilRotifer put together for creating a SILVA database, I was wondering what folks generally do with non-ATCG characters among the representative sequences.

I’ve noticed that a U --> T conversion is standard, but I was wondering what folks do with ambiguous characters like W, R, Y, etc. Are these sequences left alone? Are they deleted and gaps inserted?

I’m guessing these are left alone, but now appreciate that you can inflate your data base quite a bit if your sequences contain just a few randomly distributed ambiguous (ex. R, Y) or undefined (N) characters, or alternatively, one could deflate the database size quite a bit through clustering. I’m finding that the COI database I’m working with has many instances where this is the case.

Thanks for any comments!

Hi @devonorourke,
I personally would just leave those alone. Definitely do not delete them/insert gaps. The ambiguous bases should be interpreted appropriately, e.g., by scikit-bio (I think!), but gaps would cause issues if using a function that expects an unaligned fasta format.

But let’s see what @SoilRotifer suggests!

1 Like

Minor follow up -
I noticed this only because when I was dereplicating with Vsearch it threw an error indicating that there were a few I characters that were discarded. I wasn’t sure what that was doing in there and was trying to figure out a way to find which sequences contained that offending character… Which led me down the rabbit hole of wondering how accepted nonATCG characters were distributed…

I’m going to stick with your advice and just leave the proper ambiguous characters, but in the cases where there is an I, I think I have to delete it/them. It turns out there is just one record among the > 3 million records with an I so deleting it is probably a safe bet. About 90% of the records contain only ATCG characters anyway, so I don’t foresee the ambiguous character issue as a huge problem anyhow. Nevertheless, still curious what database curators do in these instances.

Thanks @Nicholas_Bokulich


Hi @devonorourke & @Nicholas_Bokulich,

I have had similar issues with I characters when using vsearch and other similar tools. Keep in mind that I stands for the chemical Inosine / deoxyInosine, which you can order as part of your oligos (e.g. from IDT). That is, it functions literally as an ambiguous N base. So, wherever you find and I replace with N. For example, take a look at the CO1 blocking primer we made in our swine diet paper.

As for the U --> T bit… In the past, when I put that set of scripts together, several down stream tools did not like U, or had problems when trying to map reads with T against reference data containing U. There were likely many other issues that I’ve long since forgotten, but it happened often enough that I just got into the habit of converting.

I normally follow the approach @Nicholas_Bokulich outlined. However, I have been known to be far more strict when making a reference database for very short marker sequences, e.g. various primer sets for the trnL (UAA) gene generate very short amplicons ~ 60 bp. Thus, any ambiguous bases over such short read lengths can be problematic when making your own reference database, as a query may more easily map to multiple references, at least more so than usual. In this case, I removed any reference sequence that contained any IUPAC ambiguity codes within this amplicon region of interest. Normally, I am not so conservative and only remove sequence data that contains stretches of several IUPAC ambiguity letters (e.g. a stretch of 4-7 IUPAC ambiguity letters).

The general approach: consider the length and variability of the marker gene from which you are building your reference sequences.

I hope this helps. I’m happy to hear that old bit of code is still helpful for some folk. :slight_smile:

-Best wishes

1 Like

@devonorourke, have you checked out the CO1 database / classifier from Terri Porter’s lab?


Thanks for the comments @SoilRotifer .
Didn’t know about that other COI database, but it looks like they keep all COI records, not just arthropods, which is what I’m after. I was also interested in pulling data directly from BOLD in a systematic manner rather than depending on others preformatted sets with their own assumptions, but I appreciate the tip - now I have other databases to compare classification results with!
Jon Palmer’s amptk pipeline also comes with a preformatted COI dataset, which contains it’s own distinct strategy for building a database. Seems like at least a handful of people out there are up to the challenge, albeit with their own preferences in how to curate the records.

Appreciate the feedback

1 Like

Awesome. Glad we could help you out on several fronts! I try to follow the CO1 based sequencing literature when I can. Keep us posted as to how things work out! :slight_smile: