Handling ambiguous nucleotides in database construction

Minor follow up -
I noticed this only because when I was dereplicating with Vsearch it threw an error indicating that there were a few I characters that were discarded. I wasn’t sure what that was doing in there and was trying to figure out a way to find which sequences contained that offending character… Which led me down the rabbit hole of wondering how accepted nonATCG characters were distributed…

I’m going to stick with your advice and just leave the proper ambiguous characters, but in the cases where there is an I, I think I have to delete it/them. It turns out there is just one record among the > 3 million records with an I so deleting it is probably a safe bet. About 90% of the records contain only ATCG characters anyway, so I don’t foresee the ambiguous character issue as a huge problem anyhow. Nevertheless, still curious what database curators do in these instances.

Thanks @Nicholas_Bokulich

2 Likes