I need to get an estimation of how theoretical and practical bacteria counts are different (In fact I need to understand how practical results corresponding to theoretical mixture )
in general I have
t = (t1, .. tN) - theoretical vector, sum ti = 1, 3 < N < 100
s = (s1, .., sM) - calculated read counts with assigned tax, M > N, usually.
could be also normilized - sum sj = 1, Nevermind
all data are represented as taxons to species level
so I need best estimation
d = dist(t, s)
How would you recommend to solve the question
in case if part or big part of theoretical taxons are lacking in taxonomy ( for example theoretically known that some t_i is species Anaerocellum thermophilum but don't have such species in taxonomy I using in practical analysis -- like qiime2) . Imagine in worst case that 90% of theoretcal bacterial is just totally lacking in taxonomy.
Hi @biojack ,
I recommend reading the papers that I suggested here. Especially the first, which describes a few metrics for this specific problem, both distance metrics and qualitative (presence-absence) based tests:
We also have a plugin to do this exact evaluation already (q2-quality-control). See the docs on the QIIME 2 website for some usage examples (there are also some tutorials in the "community tutorials" section of this forum with examples, e.g., using fungal data).
I think you maybe mentioned in a previous thread that you want to do this test to compare different reference databases? Is that correct? In that case we have some additional methods for such database evaluations in RESCRIPt, and a benchmark of several different databases in this paper:
Not exactly. The question here is not to compare two libraries but compare theoretical mixture percentage with practical results on particular library. Probably your suggestion also could be applied here, but not sure
@Nicholas_Bokulich hi, I should ask one minor question more to avoid confusion from all topic readers
in TAR (taxon accuracy rate ) metric from article mentioned above is in account of FP (false positive) entering only wrong species at same level or even misclassification on level above? I mean if there species
The methods section of that article describes all of those metrics in more detail, and how they are calculated at different ranks. TAR/TDR are semi-quantitative (presence-absence) and Precision/Recall are fully quantitative (calculated per read).
In cases of underclassification (what you describe), this would be a false negative at the species level (the expected species is not seen) but it would not be a false positive. As you suggest, a FP would be when the wrong label is given (i.e., misclassification). This is the difference between under- and misclassification, and also between FP and FN.
that's answering my question about TAR. Thank you!
I understood this, but can't understand such difference. In Mock-20 there are uniform mixture distribution. So for each bacteria from composition there should be enough obtained reads. So IMO there should be no difference between TDR and Recall (if Recall equal to 1 then TDR shoud be equal to 1 too). But it exists. Looks like I wrongly understand something