Outliers in beta diversity analyses

Hi everyone, I have a question about the result that I've obtained running diversity analyses.

I runned analyses with the "qiime diversity beta-group-significance" plugin for all the variables I have in my sample metadata file.

I am not quite sure about the results I've obtained: from the beta-group-significance, eaven if I've obtained p-vales < 0.05, I have boxplots with a lot of outlyers. Basically boxplots apears as flat lines and outliers form a vertical line.

I was wondering if it is normal to obtain such a result.

Also, about my experimental design: I am analysing ITS1 file coming from three species (e.g. species 1, species 2, species3).
Number of samples coming from the different species are quite different:
-Species 1: 219
-Species 2: 192
-Species 3: 100
511 samples in total.

Do you think that the results I've obtained is because I have a non-equal number of samples? Should I include 100 samples per species in the analyses, in order to have same number of samples per each species?

Thank you so much for your help!

Hello Edoardo,

Welcome to the forums! :qiime2:

Would you be willing to post the result file so we could take a look? (I understand if not, as I have also worked the sensitive / private data.)

This is very helpful! I have a question about the terminology.

When you say 'species', is this a single fungal isolate/axenic culture?
When you say 511 'samples', did you perform PCR 551 times and sequence 551 samples with different barcodes, or do you have 551 genomic sequences spread across three taxa?

1 Like

Hi Colin,
thank you so much for your reply.

Yes, I attach a screenshot about one of the result I've obtained with beta_group_significance:

PERMANOVA results
method name PERMANOVA
test statistic name pseudo-F
sample size 235
number of groups 3
test statistic 1.070818
p-value 0.001
number of permutations 999

Each boxplot correspond to Plant A, Plant B, Plant C.

Thank you for the question, and I apologize for not being clear. The term "species" I used earlier refers to the samples from which DNA was extracted. These samples were collected from three different plant species. So, to clarify, if I replace the term "species" with the term "plant," my sampling setup is as follows:

  • Plant A: 219
  • Plant B: 192
  • Plant C: 100

In my analyses, I included a total of 511 samples, which are tissues obtained from different plants. As a result, I have imported 1022 tar.gz files, consisting of 511 forward files and 511 reverse files.

Thank you so much for your help.
Best,
E.

1 Like

Thank you for telling me more.

This is very good to hear! Sometimes people only have a few samples and don't have enough statistical power. I'm glad you have 100s of samples.

Keep all the data you have! Haveing more samples is better, even if groups are uneven.


The alpha diversity tests measure each sample independently. This means your Plant C: 100 samples would yield 100 alpha diversity values.

The beta diversity tests measure pairwise comparisons between samples, so comparing each Plant C sample against each other Plant C sample yields 100*99 = 9900 beta diversity values.

This extremely large number explains boxplots:

  • most beta diversity values are near the mean, so the IQR is small, and the boxplot looks 'flat'
  • outliers are calculated somehow (maybe 1.5*IQR), and even though it's a small fraction of total beta diversity values, that's still dozens of points to show on the graph
1 Like

Dear Colin,

Thank you so much for your response Colin and for your feedback about the analyses I am conducting. It was really helpful!

Best,
E.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.