SeqID for Classifier Contain both numerical and alphabetical characters

aalex · May 22, 2019, 5:02pm

Hello,

I noticed on the linked forum post you brought up that the solution suggested assumes that sequence ID's are solely numerical. I have run into a similar problem, but have IDs for the sequences that are both numerical and alphabetical.

How can I circumvent this? And what are the character requirements for using the classifier?

Cheers!
Andrea

Nicholas_Bokulich · May 22, 2019, 6:13pm

Welcome to the forum @aalex!

See the awk script used in this tutorial:

Uppercase valid nucleotide base codes (degenerate codes are accepted).

aalex · May 22, 2019, 8:40pm

Thank you for responding to me so quickly!

I'm not sure the script provided aids me in the problem I am encountering, though I do think I might have miscommunicated. I am not getting the ValueError from any of my sequences, or at least, from what I could gather.

This is the error I have seemed to encounter, but "I" only occurs as a character in the sequence ID's, which are formatted as such:

However, if I have understood the tutorial, it is just meant to remove the presence of lowercase letters from the sequence information.

What I was thinking to do is change all characters not in the list of acceptable characters to their lowercase counterpart with respect to the sequence ID. I am uncertain if this would be sufficient to address the error, and if it would allow me to continue onwards and train the classifier.

Sorry for all this!

Nicholas_Bokulich · May 22, 2019, 9:17pm

Yes it looks like you are getting a distinct error message from the topic you linked to. The issue in that topic was lowercase characters in the sequence. Your issue is an invalid character "I"

Those sequence IDs contain "1" (one) characters, not "I" (eye) characters.

The sequence IDs are not relevant here — do not attempt to modify these, it will not fix your problem!

Instead look for "I" characters and convert these, or maybe remove that sequence — do you have amino acid sequences in your file? Because "I" is not a degenerate nucleotide base as far as I know.

Note: you also have gaps in your sequence(s), and apparently spaces between accessions? These gaps and spaces should be removed or they will likely cause other problems.

Let us know if that helps!

system · June 23, 2019, 3:24am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.