How to simplify the formula gini index in QIIME2? For example write in a R script.

yangyue · October 31, 2024, 10:19am

The result of Gini index on QIIME2 is puzzling！
I've used the following to calculate Gini index on QIIME2, for example:

qiime diversity alpha \
--i-table table1_1m2-featuretable.qza \
--p-metric gini_index \
--o-alpha-diversity table1_1m2-featuretable-gini_index.qza
qiime metadata tabulate \
  --m-input-file table1_1m2-featuretable-gini_index.qza \
qiime tools view table1_1m2-featuretable-all.qzv

I've checked skbio.diversity.alpha.gini_index — scikit-bio 0.6.2 documentation these days, and I have visted its related code website scikit-bio/skbio/diversity/alpha/_gini.py at 0.6.2 · scikit-bio/scikit-bio · GitHub, it seems simple, but I can't understand, could it be written in R script format?
I've tried to calculate it (Gini index) in R, and the following are four methods I've run in R, however their results are in variant, and none of them can match the result calculated by QIIME2 (name of the data is called D1):

# Using two packages
library("ineq");
library("DescTools")

# Method 1:
Gini_D <- matrix(nrow = 1, ncol = nrow(D1))
for (i in c(1:nrow(D1))) {
  Gini_D[i] <- ineq::Gini(factor(as.numeric(D1[i,])))
  rownames(Gini_D)<-"Gini_D"
}
Gini_D

# Method 2:
Gini_D <- matrix(nrow = 1, ncol = nrow(D1))
for (i in c(1:nrow(D1))) {
  Gini_D[i] <- DescTools::Gini(as.numeric(D1[i,]))
  rownames(Gini_D)<-"Gini_D"
}
Gini_D

# Method 3:
Gini_D <- matrix(nrow = 1, ncol = nrow(D1))
for (i in c(1:nrow(D1))) {
  Gini_D[i] <- DescTools::GiniSimpson(factor(as.numeric(D1[i,])))
  rownames(Gini_D)<-"Gini_D"
}
Gini_D

# Method 4:
Gini_D <- matrix(nrow = 1, ncol = nrow(D1))
for (i in c(1:nrow(D1))) {
  Gini_D[i] <- DescTools::GiniDeltas(factor(as.numeric(D1[i,])))
  rownames(Gini_D)<-"Gini_D"
}
Gini_D

colinbrislawn · November 1, 2024, 3:28am

In case you have not found this, here's the docs for DescTools:

Calculate the Gini-Simpson coefficient, the Gini variant proposed by Deltas and the Hunter-Gaston Index.

These variants are supposed to be different.

I love a benchmark, but don't forget the lit review!

Check the docs for ineq::Gini() and DescTools::Gini and report back!

yangyue · November 1, 2024, 8:36am

Thank you! It has been resolved.
The resolve code as below, I've set an origial point (0,0) with each calculation, and set areaB as a trapezoids, D1 is the datasheet.

Gini_D <- matrix(nrow = 1, ncol = nrow(D1))
under_areaB_array <- matrix(nrow = 1, ncol = nrow(D1))
upper_areaA_array <- matrix(nrow = 1, ncol = nrow(D1))
for (i in c(1:nrow(D1))) {
  interval <- 1/ncol(D1)
  area_total <- 1*1/2
  orderedD1 <- D1[i,][order(as.numeric(D1[i,]))]
  orderedD1_accum <- cumsum(as.numeric(orderedD1))
  orderedD1_accum_rel <- c(0,as.numeric(orderedD1_accum/sum(orderedD1)))
  under_areaB_array[i] <- 0.5*interval*(0+sum(orderedD1_accum_rel[c(2,3,4,5)]*2)+orderedD1_accum_rel[6])  ##Please change this line with the actural value of ncol(D1)
  upper_areaA_array[i] <- area_total-as.numeric(under_areaB_array[i])
  Gini_D[i] <- upper_areaA_array[i]/area_total-interval
  rownames(Gini_D)<-"Gini_D"
}
Gini_D

Now, the result matches with the QIIME2 output. Congratulations!
Please arise your suggestions. Thank you!