comparison of theoretical bacteria distribution and practical counts

Hi

I need to get an estimation of how theoretical and practical bacteria counts are different (In fact I need to understand how practical results corresponding to theoretical mixture )

in general I have

t = (t1, .. tN) - theoretical vector, sum ti = 1,   3 < N < 100 
s = (s1, .., sM) - calculated read counts with assigned tax, M > N, usually. 

could be also normilized - sum sj = 1, Nevermind

all data are represented as taxons to species level

so I need best estimation

d = dist(t, s)

How would you recommend to solve the question

  1. in general
  2. in case if part or big part of theoretical taxons are lacking in taxonomy ( for example theoretically known that some t_i is species Anaerocellum thermophilum but don't have such species in taxonomy I using in practical analysis -- like qiime2) . Imagine in worst case that 90% of theoretcal bacterial is just totally lacking in taxonomy.

Thank you much for your attention

Hi @biojack ,
I recommend reading the papers that I suggested here. Especially the first, which describes a few metrics for this specific problem, both distance metrics and qualitative (presence-absence) based tests:

We also have a plugin to do this exact evaluation already (q2-quality-control). See the docs on the QIIME 2 website for some usage examples (there are also some tutorials in the "community tutorials" section of this forum with examples, e.g., using fungal data).

I think you maybe mentioned in a previous thread that you want to do this test to compare different reference databases? Is that correct? In that case we have some additional methods for such database evaluations in RESCRIPt, and a benchmark of several different databases in this paper:

Good luck!

4 Likes

Not exactly. The question here is not to compare two libraries but compare theoretical mixture percentage with practical results on particular library. Probably your suggestion also could be applied here, but not sure

Before closing topic I should separately mark article which most useful for it ( because if you will go just to the link above it will go to the RDP classifier article which is not most useful )

So

1 Like

@Nicholas_Bokulich hi, I should ask one minor question more to avoid confusion from all topic readers

in TAR (taxon accuracy rate ) metric from article mentioned above is in account of FP (false positive) entering only wrong species at same level or even misclassification on level above? I mean if there species

Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter;Acinetobacter_baumannii_ATCC_17978

And if there would be only genus classification (instead species)

Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter

would it be a false positive or it will be removed from TAR account ? Would be a wrong genus here accounted in FP, for example

Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella

Also there as an optional question

There are Presion/Recall metrics in article which is remind TAR/TDR metrics but calculated in own way. Most confusion is that in git supplementary there are much better results from those metrics comparing to TAR/TDR tax-credit-data/evaluate-classification-accuracy-naive-bayes-only.ipynb at master · caporaso-lab/tax-credit-data · GitHub. So would be great to understand reason of such effect

1 Like

Hi @biojack ,

The methods section of that article describes all of those metrics in more detail, and how they are calculated at different ranks. TAR/TDR are semi-quantitative (presence-absence) and Precision/Recall are fully quantitative (calculated per read).

In cases of underclassification (what you describe), this would be a false negative at the species level (the expected species is not seen) but it would not be a false positive. As you suggest, a FP would be when the wrong label is given (i.e., misclassification). This is the difference between under- and misclassification, and also between FP and FN.

1 Like

Hi @Nicholas_Bokulich

that's answering my question about TAR. Thank you!

I understood this, but can't understand such difference. In Mock-20 there are uniform mixture distribution. So for each bacteria from composition there should be enough obtained reads. So IMO there should be no difference between TDR and Recall (if Recall equal to 1 then TDR shoud be equal to 1 too). But it exists. Looks like I wrongly understand something