Unbalanced sampling design and permanova


I have the same issue. My project includes samples from different plant genotypes, but the reads generated from the wildtypes are very low leading to being excluded from the analysis while choosing a good sampling depth. When trying a shallow depth, I lost lots of the data/information about the microbiome, but I could keep only one sample per wild genotype and permanova test resulted in the error above (previously posted). I believe in my case, I would not be able to calculate beta-diversity, correct?
One more question, in my current project, I do not have an equal number of samples or replicates across genotypes and I believe I have to get rid of extra samples/replicates for fair comparisons, should not I?. Any recommendations for proper calculations/analyses?


Any recommendations from the community!


Please be patient

Hi @Eman , please re-read the Code of Conduct's section on Patience. It's been a busy couple of days on the forum, but this post hasn't been forgotten. :slight_smile:

Please don't pile on to old questions

When you run into an issue someone else has posted about before, please open a new topic linking to the research you've done, instead of adding to someone else's question. No big deal - it just helps keep things tidy around here.

Please ask one question per topic

This isn't a User Support question, and isn't directly related to this topic. Please open this as a separate question in the General Discussion category.

Your Question

Why do you think this is the case? No need to answer me here, but it's a question you should be asking in case it's related to technical biases. Depending on your situation, solving this directly might provide a better solution than anything you could do in analysis.

Choosing the best sampling depth for your study can be a delicate balance. Depending on how low your WT sequencing depth is, it might still be better to keep those samples than to keep more reads. Your goal, after all, is probably to study a representative selection from the population of WT organisms. Performing statistical analysis based on only one sample seems like a good way to throw out any statistical power you might have had. If you decide to go this route anyway, plan to justify your decision in the paper.

It is possible to calculate the difference between one WT sample and your other samples, but you run the risk of asking the wrong question. That is, your question changes from "What is the difference between WT and intervention communities?" to "What is the difference between this WT community, and the intervention communities?" This may limit what you can conclude from the study.

You're banging your head against one of the fundamental problems with rarefying data here (i.e. loss of data). If it's not possible for you to improve the disparity in sampling depth between your WT and experimental groups, and you lose too much data with a low sampling depth, you might be able to make progress using different normalization techniques. Using relative frequencies instead of count data might allow you to ask some diversity questions, but looking at relative abundance of taxa may require more advanced techniques. This paper may help you avoid some of the pitfalls, but this is solidly beyond my expertise. If it's also outside of your comfort zone, you might consider collaborating with a bioinformatician or statistician who has experience with these approaches.

Good luck!


Thanks, Chris, and sorry for my delayed reply. I already created a new post as recommended by you.

Regarding your reply to my comment "the reads generated from the wildtypes are very low"
I do not think it is a technical bias. I calculated good's coverage as follows:

qiime diversity alpha \
  --i-table table-no-mitochondria-no-chloroplast.qza \
  --p-metric goods_coverage \
  --o-alpha-diversity goods_coverage_vector.qza

qiime metadata tabulate \
  --m-input-file goods_coverage_vector.qza \
  --o-visualization goods_coverage_vector.qzv

That results in good coverage of 1 for all samples. So, I do not think it is a technical issue/bias.

For the diversity calculation, I agree with you, I should try other normalization techniques to avoid data loss, especially I have many samples with very low read counts and by dropping them I will be losing important genotypes in the analysis.


1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.