Uneven sample size, a need for rarefaction tool

joannakolodz · November 13, 2024, 12:54pm

Hi!
In my study, I have five populations with highly uneven sample sizes (A: 47, B: 10, C: 90, D: 25, E: 7). I’m looking for a tool that can perform rarefaction and handle these differences in sample size, as I'm not sure that results of alpha and beta diversity analyses are proper without this step.
Is there a way to perform it in qiime2?
I'll be grateful for any suggestions.

Best regards,
Joanna

jwdebelius · November 13, 2024, 4:05pm

Hi @joannakolodz,

Are those the number of sequences in a given sample, or are those independent samples with some depth (thousands of sequences)? Becuase the recommendation very much depends on your answer!

Best,
Justine

joannakolodz · November 14, 2024, 7:13am

Hi Justine,
Those are number of individuals (independent samples) in each population.

Best,
Joanna

jwdebelius · November 14, 2024, 10:26pm

Hi @joannakolodz,

I think uneven sample sizes is a challenging problem and I dont think there's a standard approach to do this. I have a , and hopefully this isn't too long.

To answer your most direct question, within QIIME 2, you could use the subsample-ids action (thanks @gregcaporaso!) to subsample your data. If I were taking that approach in your case, I would probably divide my data into groups and then subsample by group.

You could also use your favorite program to generate a list of samples by subsampling and then filter the table/distance matrix based on that subsample. I'm not actually sure if its possible to filter alpha diveristy at the moment so you may have to see if you can run the stats in QIIME 2 if your metadata is a subset of your alpha diversity.

The biggest question is a philosophical one: is this the best approach. I can see three potential solutions to tackle your problem.

Rarefy to even group sizes
Exclude small groups
Collapse small groups if they're semantically similar

I was kind of trying to think about the three approaches in terms of differnet factors and what you need to think about

	Rarefy to even group sizes	Exclude Small Groups	Collapse similar groups
Approach	Subsample groups A, B, C, and D, to 7 samples each. Repeat???	Only analyze groups A, C, and maybe D	If D and E are similar groups, combine D and E into one category
Pros	Even sample size is needed for permanova and permdisp Even sample size may also improve performance in alpha diveristy tests^a Potentially easier to understand	Retain most of your data Less loss of power for comparisons you do perform	Retain all your samples Still able to test multiple groups of interest
Cons	You lose a lost of power by going from ~175 samples to ~35 samples. That's going to affect everything down stream.	You lose the small groups and cant addres the comparisons.	You may add noise by combining small groups. You're assuming that the things you combine are semantically similar, which may not always be true.
Assumption to limit bias	Groups you're subsampling are relatively homogenous so any subsample you take is representative	Biases/generalizability might be limited by group loss Less information to answer your hypothesis	The groups you combine are "similar enough" to make the combination work
Potential approach to mitigate bias	Show that the subsamples are similar or maybe(???) find a way to pool the results from subsampling.^b	No great solutions, but you also can't test what you dont have enough infromation to test	Check/demonstrate that your groups are still similar to each other, maybe with descriptive analysis?

^aLinear regression techniques may be more robust to differences in group sizes; if you have access to one handy, consult a statistician
^bThere are statistical techniques to allow pooling of uncertainty across multiple estimates, but I dont know if they've been applied in rarefaction. I don't know of a published method for pooling with beta diversity. Again, a statistical consult from a professional may be best.

You could also combine some of these, for example, exclude E, combine A & B, etc, depending on your experiment, group definition, and goals.

All that said, I think if you chose to rarefy to 7 samples, you need to be very careful with power and with within-group variation.

Best of luck,
Justine

joannakolodz · November 15, 2024, 2:12pm

Dear @jwdebelius, thank you so much for your response and a huge help! I'll reconsider and check which solution seems to be the best for my study design.

sjkimble · February 10, 2025, 8:24pm

Justine:
This is great, thanks! We are also thinking about this same issue that Joanna is having, and we think we want a "Rarefy to even group sizes" approach but given the large amounts of inter-individual variation we see in out (turtle) microbial community makeup, we don't think that the "Assumption to limit bias" holds for our samples. We think that just eyeballing the taxonomic barplots demonstrates this for our data. So we think we'd like a framework in QIIME2 that would basically 1.) perform subsample-ids action repeatedly (say, 999 times) on the group that is in need of subsampling (e.g., A, B, C, and D in Joanna's case), 2.) pool the subsample table with the table that was not rarefied (e.g., E in Joanna's case), 3.) run analyses for each repetition (e.g., alpha-group-significance), and 4.) offer some way to evaluate the proportion of repetitions that result in a significant p-value or FDR. Does this makes sense? Surely many studies suffer from imbalanced sample sizes?
Thanks,
Steve

jwdebelius · February 10, 2025, 10:06pm

Hi @sjkimble,

Welcome to the forum!

I agree that what you're proposing would be fantastic, however, it doesn't exist at the moment nor is it trivial to implement in QIIME 2. (Think on the scale of dissertation chapter to whole dissertation, depending on the depth and breath of the problem involved.)

If you have on handy, I'd find a statistican who is familiar with methods to pool variance both within and across samples and see if they can make recommendations for you.

Best,
Justine

sjkimble · February 11, 2025, 7:40am

Okay, thanks, Justine.