Why is using rarefaction curves to help determine sampling depth prior to producing core diversity metrics not considered p-hacking/data-peeking?

Hi, I've read through several of the forum posts/tutorials on producing a rarefaction curve, and using this to help determine the minimum sampling depth. While I understand this on a practical level - you need to normalise the data, ideally finding a balance between avoiding under-sampling diversity and not losing too many samples - when comparing alpha rarefaction curves split by the groups you want to test (e.g. the attached image), why is it acceptable to do this before calculating diversity metrics?

As the curve shows estimations of diversity at each depth, and we can see approximately how the groups we're testing behave at each point, why is this not risky in terms of p-hacking/data-peeking? For instance in the attached image, we can see choosing a sampling depth of ~10,000 and ~15,000 substantially changes the difference between the groups.

I've struggled to find any literature on this.

If anyone has any insight I would really appreciate it, and apologies if this is a very basic question!

Both methods, alpha ratefaction and core-metrics sampling depth will randomly select reads from original samples. "The best" sampling depth from ratefaction curve not necessarily produce the most significant difference between groups since in both cases different and randomly selected pools of features will be analyzed.
Reruning core-metrics with exactly the same depth will produce slightly different results. It is why I am always skeptical about p-values like, for example, 0.04 and 0.06.