Classifying Sanger Sequences of variable length

sformel · September 25, 2018, 9:10pm

This is actually a followup on this discussion but I logged in 30 minutes too late to add it to that thread…c’est la vie.

I’ve been working on using Q2 to assign taxonomy to Sanger sequences of bacterial cultures. I’m getting results, but still trying to understand how the feature-classifier sklearn is interpreting my sequences. I’ve never worked with any machine-learning based software before and I’m having trouble understanding the process.

I’m classifying against the pre-trained SILVA132 full-length database, and many sequences get good assignments. But, my sequence length varies because I trim each one individually depending on it’s quality. I’ve noticed when I have long and short (1000 bases - 1400 bases) sequences together that the classifier can’t find an assignment for the short sequences. However if I group the short sequences by themselves, the classifier has no problem finding good assignments for them.

So I have two questions:

Is it a problem (i.e. does it produce low-quality taxonomic assignments) to break up my sequences by size, classify them separately, and then put them back together for analysis as a group?

Why does the classifier fail to find assignments for short sequences when they are grouped with long sequences?

Thanks!
Steve

Nicholas_Bokulich · September 25, 2018, 9:28pm

Hi @sformel,
Great questions!

No, definitely not a problem. But I think I might have a better solution (below)

I believe what is happening here is when you have short + long sequences that read orientation auto-detector is getting confused. You can manually set the read orientation to prevent this from occurring, if you know the orientation of your reads relative to the reference. If that does not fix things, let us know!

The classifier should not behave randomly and should assign the same taxonomy to your sequences whether they are queried alone or in a group — but we have seen similar reports lately of sequences that are "randomly" left unassigned, particularly when mixed with sets of other sequences. In 100% of cases it has been the orientation auto-detector getting confused.

(basically how this classifier works is it figures out the orientation of your reads relative to the reference, reverse complements them if needed, chops them up into k-mers, and compares the frequency of kmers to the reference database kmer frequencies predict the most likely taxonomic affiliation)

Let us know if that fixes things!

sformel · October 12, 2018, 6:35pm

Sorry for taking so long to follow up. This is a side project so it took me a bit to sort it out. You were right, sequence orientation was the culprit. In my case it wasn’t the mixture of sequence length, but rather a mixed orientation of sequences. Once everything was oriented the same way, and I manually set read orientation to “same” everything was classified without problem.

Thanks for all your help!

system · November 13, 2018, 12:36am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.