I have another separate problem:
I extracted 626 rep sequences in UNITE ITS (ver7_99, 2017) reference set “sh_taxonomy_qiime_ver7_99_01.12.2017_dev.txt” labeled as, “k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified”.
But when I classified these sequences, many of them were assigned with high confidence (>0.7) to lineages other than this label by Naive Baysian Classifier trained on the same reference set containing them.
In fact, 130 out of these 626 rep sequences were assigned to species level resolution.
My three related questions are:
What could be the cause of these discrepancies?
Isn’t “k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified” equivalent to “k_Fungi” ?
Does the inclusion of these type of sequences in the training set add any value to the trained classifier or they are merely noises?
No. This is how those sequences are actually annotated in the reference database. These are unidentified fungi that are being added to the database without annotation, which is not particularly useful. This is distinct from receiving a classification of k_Fungi, which indicates that the sequence cannot be classified above kingdom level (and is actually quite likely to be junk/non-target DNA)
No! Merely noise, and quite likely to reduce classification accuracy. I have removed sequences like this from databases in the past.
Whoever deposited the sequences did not or could not identify the clade they came from. But classify-sklearn is able to do a better job of predicting their affiliation. The discrepancy is that unidentified ≠ unclassifiable!