Custom OTU Clustering , specifity

osman · March 17, 2020, 2:52pm

Hello,
I am working on a custom OTU clustering algorithm and trying to measure sensitivity, specificity.

Should I calcualte the True Positive (TP), False Positive (FP), False Negative(FN) and True Negative values for each OTU and get the average, or it should calculate it for all. If so, then

How can I calculate True Negatives? Since it is different for each OTU cluster?
My evaluation is :
If a sequence is correctly clustered then it is TP
If a sequence is not correctly clustered then it is FP
If a sequence could not be clustered and left alone it is FN

now, for True Negatives I can calculate them for each OTU :
TN= N-(TP+FP+FN)

Should I simply calculate sensitivity and specifity for each OTU alone and then take an average ?

I really appreciate your help for me to understand this concept.

I hope someone can help me.

Thank your

colinbrislawn · March 17, 2020, 3:36pm

Good morning, Osman,

Cool!

It sounds like you are asking about different parts of the confusion matrix. If possible, I would report on all 4 combinations (TP, FP, TN, FN), then people can choose whatever metric they like best.

For example, I like balanced accuracy, and I could calculate that from (TP, FP, TN, FN).
The authors of OptiClust, report the Matthews correlation coefficient
The vsearch devs graphed TPR vs FDR as figure 1, then reported Rand Index, recall, and precision in Figures 2 and 3

How do you define that? Do your custom algorithm consider something that others do not?

Colin

osman · March 17, 2020, 3:46pm

Basically I have used the Greengenes reference database an picked up samples from some subgroups in genus level.

True negative numbers are different for each OTU since there are different number of FP and FN numbers?

Eventually, let me ask in this way :

If I calculate sensitivty and specifity for each OTU and take the average of it ? Does it make sense?

osman · March 17, 2020, 3:46pm

Sorry , I should also indicate that I am working on a closed reference clustering.

colinbrislawn · March 17, 2020, 6:24pm

I think I'm beginning to understand. So you take sequences from greengenes, then cluster them against greengenes, and it's a True Positive if it aligns to the same sequence? Or it's TP if it aligned to the same genus?

Sure. The average of sensitivity and specificity is balanced accuracy. That's my favorite too

But I'm not quite sure what you are measuring...

Colin

osman · March 17, 2020, 7:19pm

Selected sequences are excluded from the refererence database.

my TP is if it is in the same genus .

I am trying to measure how well my clustering algorithm is with the grand truth.

So, suppose there were 100 clusters in the grand truth and my algorithm has some number of OTUs. So I will calculate Sensitivity and specifity for each OTU and take the average,. I hope this measurement make sense ?

And finally i am disregarding the specifity and sensitivity of OTUs with only 1 sequence. I hope this is also Ok.

colinbrislawn · March 17, 2020, 8:41pm

Ah OK! This sounds like "leave-one-out cross validation". That was used to test the RDP classifier and SINTAX.

It's a good method! I've never done this benchmark before, but I would be interested in seeing your results.

Based on what you have described here, it sounds like you are testing both OTU clustering, and also taxonomy assignment, is that correct? Which database are you using as ground truth?

Colin

osman · March 17, 2020, 9:25pm

I have randomly selected 100 genus having minimum 50 sample in each genus group from GreenGenes.

I have excluded these selected sequences from GreenGenes.

The remaining GrenGenes database is used for closed reference. I am still learning. I am using the taxonomy only for the ground truth purposes. I am also wondering the difference between closed reference OTU picking and taxonomy assignment :)))

I really appreciate your time and help. There is more I did not tell here but I will share it in this forum too If it turns out to be something