I agree with @Mehrbod_Estaki that your ability to handle uneven sample sizes depends less on your metric and more on your statistical test (although I often find my test and my metric are related since some metrics are more normal than others.) My rule of thumb tends to be about 10% subgroup. It’s not a great rule of thumb, but its whats worked for me in the past.
…But, if we’re talking about metrics and depth, I like richness metrics. (My preference is observed features, because I tend to work with methods that don’t leave singletons in my data or dont have meaningful singletons.) Ive worked on multiple studies where the issue is richness and the stochastic loss of organisms is the signal. Shannon is definitely more robust to depth issues because it down weights those lower abundance organisms, and tends to saturate pretty quickly. But… because it down-weights those organisms it can make for a smaller effect size. (I published a recent paper where shannon had half the explanatory power of observed features.)
My solution recently has been to use multivariate regression on my richness metrics and adjust for sequencing depth or log sequencing depth. They tend to be normal or close enough to fudge it. (Thanks central limit theorem!) It’s not always perfect, particularly if I have outliers, but it can help decrease some of those depth-related effects without actually requiring me to figure out how to propagate error across multivariate tests. (I took calc and did propagation of error analysis in college, but it’s also not something I want to do again every week.)
As far as literature on sequencing depth and where metrics emphasize, you might look for hill numbers. I dont have a paper off the top of my head, but that sort of more formally addresses what Bod and I are trying to say while waving our hands.