Unidentified sequences in the UNITE database — why can classify-sklearn classify these?

Great questions @chaibenl!

No. This is how those sequences are actually annotated in the reference database. These are unidentified fungi that are being added to the database without annotation, which is not particularly useful. This is distinct from receiving a classification of k_Fungi, which indicates that the sequence cannot be classified above kingdom level (and is actually quite likely to be junk/non-target DNA)

No! Merely noise, and quite likely to reduce classification accuracy. I have removed sequences like this from databases in the past.

Whoever deposited the sequences did not or could not identify the clade they came from. But classify-sklearn is able to do a better job of predicting their affiliation. The discrepancy is that unidentified ≠ unclassifiable!

1 Like