Hi!
In my study, I have five populations with highly uneven sample sizes (A: 47, B: 10, C: 90, D: 25, E: 7). I’m looking for a tool that can perform rarefaction and handle these differences in sample size, as I'm not sure that results of alpha and beta diversity analyses are proper without this step.
Is there a way to perform it in qiime2?
I'll be grateful for any suggestions.
Are those the number of sequences in a given sample, or are those independent samples with some depth (thousands of sequences)? Becuase the recommendation very much depends on your answer!
I think uneven sample sizes is a challenging problem and I dont think there's a standard approach to do this. I have a , and hopefully this isn't too long.
To answer your most direct question, within QIIME 2, you could use the subsample-ids action (thanks @gregcaporaso!) to subsample your data. If I were taking that approach in your case, I would probably divide my data into groups and then subsample by group.
You could also use your favorite program to generate a list of samples by subsampling and then filter the table/distance matrix based on that subsample. I'm not actually sure if its possible to filter alpha diveristy at the moment so you may have to see if you can run the stats in QIIME 2 if your metadata is a subset of your alpha diversity.
The biggest question is a philosophical one: is this the best approach. I can see three potential solutions to tackle your problem.
Rarefy to even group sizes
Exclude small groups
Collapse small groups if they're semantically similar
I was kind of trying to think about the three approaches in terms of differnet factors and what you need to think about
Rarefy to even group sizes
Exclude Small Groups
Collapse similar groups
Approach
Subsample groups A, B, C, and D, to 7 samples each. Repeat???
Only analyze groups A, C, and maybe D
If D and E are similar groups, combine D and E into one category
Pros
Even sample size is needed for permanova and permdisp Even sample size may also improve performance in alpha diveristy testsa Potentially easier to understand
Retain most of your data Less loss of power for comparisons you do perform
Retain all your samples Still able to test multiple groups of interest
Cons
You lose a lost of power by going from ~175 samples to ~35 samples. That's going to affect everything down stream.
You lose the small groups and cant addres the comparisons.
You may add noise by combining small groups. You're assuming that the things you combine are semantically similar, which may not always be true.
Assumption to limit bias
Groups you're subsampling are relatively homogenous so any subsample you take is representative
Biases/generalizability might be limited by group loss Less information to answer your hypothesis
The groups you combine are "similar enough" to make the combination work
Potential approach to mitigate bias
Show that the subsamples are similar or maybe(???) find a way to pool the results from subsampling.b
No great solutions, but you also can't test what you dont have enough infromation to test
Check/demonstrate that your groups are still similar to each other, maybe with descriptive analysis?
aLinear regression techniques may be more robust to differences in group sizes; if you have access to one handy, consult a statistician bThere are statistical techniques to allow pooling of uncertainty across multiple estimates, but I dont know if they've been applied in rarefaction. I don't know of a published method for pooling with beta diversity. Again, a statistical consult from a professional may be best.
You could also combine some of these, for example, exclude E, combine A & B, etc, depending on your experiment, group definition, and goals.
All that said, I think if you chose to rarefy to 7 samples, you need to be very careful with power and with within-group variation.
Dear @jwdebelius, thank you so much for your response and a huge help! I'll reconsider and check which solution seems to be the best for my study design.