How to compare/analyze genotypes of unequal number of samples and/or replicates?

Hi everyone,

I am currently working on sequence data of different plant genotypes. The issue is that I do not have an equal number of samples or replicates across these genotypes and I am not sure whether I have to get rid of extra samples/replicates for fair comparisons or not? I mean should I have an equal number of replicates for each genotype to perform any analysis (e.g., diversity analyses)?
Any recommendations for proper calculations/analyses?

Thanks!

Hello Eman,

Thanks for being patient while we got to your question.

More data is good, so keep all your samples! While having a balanced study design and equal variance between groups is best, some tests are robust to unbalanced cohorts and unequal variance. For example, the Kruskal-Wallis test used to test alpha diversity in the PD-mouse tutorial does not require equal samples in each group.

How many samples do you have within each genotype? How many replicates?

P.S. Here's some more things to think about, if you are interested :thinking:

Much thanks, Colin, and sorry for my delayed reply!
I have 2 projects having the same issue.
The first includes 18 genotypes, 5 of them represented in 5 samples, 3 genotypes of 4 samples, 2 genotypes of 3 samples, 5 genotypes of 2 samples, and 3 genotypes represented each as one sample. regarding the replicates, I have 5 replicates but each has a different number of samples. for example, some replicates include 10 samples, others 14/11
The sample size in this project ranges between 14355 - 23. some low sample sizes are generated from genotypes that are represented as a single sample. So, I do not believe it is a technical issue.

The second project includes 16 genotypes where one genotype is represented as 4 samples, 3 genotypes as 3 samples, 4 genotypes as 2 samples, and 8 genotypes as 1 sample. Here, I have also 5 replicates and each of a different number of samples. The sample size ranges between 1112 to 4. The total sample count is 29, 23 of them with a sample size below 500.

Your help is much appreciated!

Thanks for telling me more.

Can you write up the study design in a table? I'm having a little trouble keeping track of samples per group, and I think a table like this would really help me.

genotype samples reps total samples for this genotype
g1 5 5 25
g2 5 5 25
g3 5 5 25
g4 5 5 25
g5 5 5 25
g6 4 5 20
g7 4 5 20
g8 4 5 20
g9 3 5 15
...

(I use this tool to make tables. You could also take a screenshot of your table)


When you say 'samples' do you mean sequencer reads / features?

1 Like

Thanks for the provided Table generator. Hereinafter the tables of project 1

genotype No. of samples or replicates
g1 2
g2 4
g3 5
g4 5
g5 1
g6 1
g7 5
g8 4
g9 2
g10 3
g11 5
g12 3
g13 5
g14 2
g15 4
g16 1
g17 2
g18 2
replicate No. of samples
rep1 10
rep2 14
rep3 12
rep4 10
rep5 10

project2
I have 60 samples that filtered into 29 samples after removing mitochondria and chloroplast since they are plant tissue samples

genotype No. of samples or replicates
g1 2
g2 5
g3 5
g4 5
g5 5
g6 5
g7 5
g8 3
g9 5
g10 2
g11 4
g12 4
g13 4
g14 2
g15 4
replicate No. of samples
rep1 7
rep2 6
rep3 7
rep4 5
rep5 4

sample here is represented as sampleID, and sample size means read count generated from the sequencing of this sample. I like to mention something else, each genotype is represented by a number of samples/replicates but they are not consistent (i.e. g1 represented by 2 samples or replicates including rep1, rep4). Does this impact the way the data are analyzed?

Much thanks!
Eman

1 Like

Hey Eman! :wave:

I think I'm getting a better sense of the two projects and their two study designs. I am very close to understanding!

Yes! And I'm not sure I understand this part of the study design... :thinking: :face_with_monocle:

Like this?

genotype total number of reps reps in this genotype
g1 2 rep 1, rep 4
g2 example 3 rep 1, rep, 8, rep 9

Do your reps overlap between genotypes, like rep 1 in my example?

If that table is correct, could we also organize this data by the SampleID of amplicon samples, instead of genotype? It might look like this:

SampleID genotype rep
s1.g1.r1 g1 rep 1
s2.g1.r1 g1 rep 4
s3.g2.r1 g2 rep 1
s4.g2.r1 g2 rep 8
s5.g2.r1 g2 rep 9

Thank you for explaining this to me. Talking through a study design is always hard, and I appreciate the time you are investing to explain this to me.

Colin :whale2:

P.S. Looks like that table generator was having a hard day :joy_cat: I've fixed the tables.

Hello Colin,
Thanks so much for your help and patience. I generated the tables as you recommended which makes it much easier to organize samples. Yes, reps overlap between genotypes.

Project 1
+----------+-------------------+------------------------------+
| genotype | total no. of reps | reps in this genotype |
+----------+-------------------+------------------------------+
| g1 | 2 | rep2, rep5 |
+----------+-------------------+------------------------------+
| g2 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+-------------------+------------------------------+
| g3 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+-------------------+------------------------------+
| g4 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+-------------------+------------------------------+
| g5 | 1 | rep5 |
+----------+-------------------+------------------------------+
| g6 | 1 | rep1 |
+----------+-------------------+------------------------------+
| g7 | 4 | rep2, rep3, rep4, rep5 |
+----------+-------------------+------------------------------+
| g8 | 4 | rep1, rep2, rep3, rep5 |
+----------+-------------------+------------------------------+
| g9 | 2 | rep2, rep3 |
+----------+-------------------+------------------------------+
| g10 | 3 | rep2, rep3, rep4 |
+----------+-------------------+------------------------------+
| g11 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+-------------------+------------------------------+
| g12 | 3 | rep2, rep3, rep5 |
+----------+-------------------+------------------------------+
| g13 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+-------------------+------------------------------+
| g14 | 2 | rep2, rep5 |
+----------+-------------------+------------------------------+
| g15 | 4 | rep1, rep2, rep3, rep4 |
+----------+-------------------+------------------------------+
| g16 | 1 | rep1 |
+----------+-------------------+------------------------------+
| g17 | 2 | rep3, rep4 |
+----------+-------------------+------------------------------+
| g18 | 2 | rep1, rep2 |
+----------+-------------------+------------------------------+

+----------+------+----------+
| SampleID | REP | Genotype |
+----------+------+----------+
| S1 | Rep5 | g1 |
+----------+------+----------+
| S2 | Rep4 | g2 |
+----------+------+----------+
| S4 | Rep2 | g1 |
+----------+------+----------+
| S5 | Rep4 | g3 |
+----------+------+----------+
| S6 | Rep3 | g3 |
+----------+------+----------+
| S7 | Rep2 | g3 |
+----------+------+----------+
| S8 | Rep1 | g3 |
+----------+------+----------+
| S9 | Rep5 | g4 |
+----------+------+----------+
| S10 | Rep4 | g4 |
+----------+------+----------+
| S11 | Rep3 | g4 |
+----------+------+----------+
| S12 | Rep2 | g4 |
+----------+------+----------+
| S13 | Rep1 | g2 |
+----------+------+----------+
| S14 | Rep2 | g2 |
+----------+------+----------+
| S15 | Rep5 | g5 |
+----------+------+----------+
| S16 | Rep4 | g2 |
+----------+------+----------+
| S17 | Rep1 | g6 |
+----------+------+----------+
| S19 | Rep1 | g8 |
+----------+------+----------+
| S20 | Rep2 | g7 |
+----------+------+----------+
| S21 | Rep3 | g7 |
+----------+------+----------+
| S22 | Rep4 | g7 |
+----------+------+----------+
| S23 | Rep5 | g7 |
+----------+------+----------+
| S24 | Rep2 | g9 |
+----------+------+----------+
| S25 | Rep4 | g10 |
+----------+------+----------+
| S26 | Rep2 | g11 |
+----------+------+----------+
| S27 | Rep2 | g10 |
+----------+------+----------+
| S28 | Rep5 | g11 |
+----------+------+----------+
| S29 | Rep3 | g11 |
+----------+------+----------+
| S31 | Rep2 | g12 |
+----------+------+----------+
| S32 | Rep1 | g13 |
+----------+------+----------+
| S33 | Rep2 | g13 |
+----------+------+----------+
| S34 | Rep3 | g13 |
+----------+------+----------+
| S35 | Rep4 | g13 |
+----------+------+----------+
| S36 | Rep5 | g12 |
+----------+------+----------+
| S37 | Rep5 | g13 |
+----------+------+----------+
| S38 | Rep1 | g4 |
+----------+------+----------+
| S39 | Rep2 | g14 |
+----------+------+----------+
| S40 | Rep3 | g15 |
+----------+------+----------+
| S41 | Rep1 | g16 |
+----------+------+----------+
| S42 | Rep5 | g8 |
+----------+------+----------+
| S43 | Rep1 | g11 |
+----------+------+----------+
| S44 | Rep4 | g15 |
+----------+------+----------+
| S45 | Rep4 | g11 |
+----------+------+----------+
| S46 | Rep3 | g10 |
+----------+------+----------+
| S47 | Rep3 | g12 |
+----------+------+----------+
| S48 | Rep4 | g17 |
+----------+------+----------+
| S49 | Rep3 | g9 |
+----------+------+----------+
| S50 | Rep3 | g2 |
+----------+------+----------+
| S51 | Rep2 | g8 |
+----------+------+----------+
| S53 | Rep2 | g15 |
+----------+------+----------+
| S54 | Rep1 | g15 |
+----------+------+----------+
| S55 | Rep5 | g14 |
+----------+------+----------+
| S56 | Rep3 | g8 |
+----------+------+----------+
| S57 | Rep5 | g3 |
+----------+------+----------+
| S58 | Rep3 | g17 |
+----------+------+----------+
| S59 | Rep1 | g18 |
+----------+------+----------+
| S60 | Rep2 | g18 |
+----------+------+----------+

Thanks again!
Eman

Hello Colin,

These are the tables for project 2
+----------+------+----------+
| sampleID | rep | genotype |
+----------+------+----------+
| ss1 | rep4 | g1 |
+----------+------+----------+
| ss2 | rep1 | g1 |
+----------+------+----------+
| ss3 | rep4 | g2 |
+----------+------+----------+
| ss4 | rep3 | g2 |
+----------+------+----------+
| ss5 | rep2 | g2 |
+----------+------+----------+
| ss6 | rep5 | g2 |
+----------+------+----------+
| ss7 | rep1 | g2 |
+----------+------+----------+
| ss8 | rep1 | g3 |
+----------+------+----------+
| ss9 | rep4 | g3 |
+----------+------+----------+
| ss10 | rep3 | g3 |
+----------+------+----------+
| ss11 | rep2 | g3 |
+----------+------+----------+
| ss12 | rep5 | g3 |
+----------+------+----------+
| ss13 | rep1 | g4 |
+----------+------+----------+
| ss14 | rep3 | g4 |
+----------+------+----------+
| ss15 | rep4 | g4 |
+----------+------+----------+
| ss16 | rep2 | g4 |
+----------+------+----------+
| ss17 | rep3 | g5 |
+----------+------+----------+
| ss18 | rep1 | g5 |
+----------+------+----------+
| ss19 | rep5 | g6 |
+----------+------+----------+
| ss20 | rep3 | g6 |
+----------+------+----------+
| ss21 | rep2 | g6 |
+----------+------+----------+
| ss22 | rep3 | g7 |
+----------+------+----------+
| ss23 | rep2 | g7 |
+----------+------+----------+
| ss24 | rep5 | g7 |
+----------+------+----------+
| ss25 | rep1 | g6 |
+----------+------+----------+
| ss26 | rep4 | g6 |
+----------+------+----------+
| ss27 | rep4 | g5 |
+----------+------+----------+
| ss28 | rep5 | g5 |
+----------+------+----------+
| ss29 | rep2 | g5 |
+----------+------+----------+
| ss30 | rep5 | g4 |
+----------+------+----------+
| ss31 | rep4 | g7 |
+----------+------+----------+
| ss32 | rep3 | g8 |
+----------+------+----------+
| ss33 | rep2 | g8 |
+----------+------+----------+
| ss34 | rep3 | g9 |
+----------+------+----------+
| ss35 | rep2 | g9 |
+----------+------+----------+
| ss36 | rep1 | g10 |
+----------+------+----------+
| ss37 | rep3 | g11 |
+----------+------+----------+
| ss38 | rep5 | g11 |
+----------+------+----------+
| ss39 | rep1 | g12 |
+----------+------+----------+
| ss40 | rep1 | g11 |
+----------+------+----------+
| ss41 | rep4 | g11 |
+----------+------+----------+
| ss42 | rep3 | g10 |
+----------+------+----------+
| ss43 | rep5 | g9 |
+----------+------+----------+
| ss44 | rep4 | g9 |
+----------+------+----------+
| ss45 | rep1 | g9 |
+----------+------+----------+
| ss46 | rep1 | g8 |
+----------+------+----------+
| ss47 | rep1 | g7 |
+----------+------+----------+
| ss48 | rep5 | g12 |
+----------+------+----------+
| ss49 | rep2 | g12 |
+----------+------+----------+
| ss50 | rep4 | g12 |
+----------+------+----------+
| ss51 | rep2 | g13 |
+----------+------+----------+
| ss52 | rep5 | g13 |
+----------+------+----------+
| ss53 | rep3 | g13 |
+----------+------+----------+
| ss54 | rep1 | g13 |
+----------+------+----------+
| ss55 | rep2 | g14 |
+----------+------+----------+
| ss56 | rep3 | g14 |
+----------+------+----------+
| ss57 | rep5 | g15 |
+----------+------+----------+
| ss58 | rep4 | g15 |
+----------+------+----------+
| ss59 | rep1 | g15 |
+----------+------+----------+
| ss60 | rep3 | g15 |
+----------+------+----------+

+----------+--------------------+------------------------------+
| genotype | total no. of reps | reps in genotype |
+----------+--------------------+------------------------------+
| g1 | 2 | rep1, rep4 |
+----------+--------------------+------------------------------+
| g2 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g3 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g4 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g5 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g6 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g7 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g8 | 3 | rep1, rep2, rep3 |
+----------+--------------------+------------------------------+
| g9 | 5 | rep1, rep2, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g10 | 2 | rep1, rep3 |
+----------+--------------------+------------------------------+
| g11 | 4 | rep1, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+
| g12 | 4 | rep1, rep2, rep4, rep5 |
+----------+--------------------+------------------------------+
| g13 | 4 | rep1, rep2, rep3, rep5 |
+----------+--------------------+------------------------------+
| g14 | 2 | rep2, rep3 |
+----------+--------------------+------------------------------+
| g15 | 4 | rep1, rep3, rep4, rep5 |
+----------+--------------------+------------------------------+

however, in this project, after filtering mitochondria and chloroplasts, only 29 samples remained.

Much thanks!
Eman

Usually when I remove sequences / reads from mitochondria and chloroplasts, I still have most of my reads left and I can continue with analysis. What command did you use to filter these samples?

EDIT: Also, when you said 'reps overlap between genotypes,' do you mean you have multiple samples from different genotypes both called 'rep 1', or rep1 is a single amplicon sample with one barcode?

The code that I used:

qiime taxa filter-table
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-exclude mitochondria,chloroplast
--o-filtered-table table-no-mitochondria-no-chloroplast.qza

when I said 'reps overlap between genotypes,'
I mean rep1 is a single amplicon sample with one barcode since each rep is a unique sample. Sorry for the confusion.

Eman

1 Like

OK. I think I'm getting it now!

How many amplicon samples do you have in total for each project?

56 amplicon samples for the first project, and 60 for the other one.

Thanks for your patience
Eman

OK! Let's return to your original question.

I think you need more samples, or fewer groups to compare, or both. And you can prove this with a power analysis. :zap:

56 samples might be an OK number when comparing two groups, because you would have 28 reps from each group in a fully balanced study. Even if the study was not fully balanced, for example 20 samples from one group and 36 samples from the other, you might have enough samples to capture variance and differentiate these two groups.

But with so many groups, and the number of reps within each one are very low, any stat tests performed will have very little power. This paper describes how to perform a power analysis, which is your next step.

1 Like

Thanks so much for your valuable advice. I totally agree with you, the statistical analyses in my case could be unreliable. It sounds like there is a problem with the study design. I do not think having more samples is possible now since the plants were grown 2 years ago and the study included wild types that are probably not available now to repeat the trial. However, I will discuss this with the group.
So, if I avoided performing the diversity testing (alpha and beta), and just focused on the taxonomy, is that meaningful for a microbiome study? Also, regarding the second project where 60 samples were filtered to 29 samples after removing organelles, any recommendations? I contacted the core facility to ask whether they used organelles' blockers or not and they confirmed they did not. However, I know this tissue is not rich in microbes at a certain growth/life stage, so it may have a poor microbial community that failed to be amplified during sequencing.

Thanks so much for your support!
Eman

1 Like

Good afternoon,

I think the two of us have reached a consensus. Yes, you could look at taxonomy and avoid any stat tests.

I also like the idea of following up with your sequencing core about samples they filtered out. If these are low biomass samples that you expect to have low richness in microbes, they might be difficult to get a lot of sequences from. :man_shrugging:

Colin

1 Like

Thanks so much for your time and support! Really appreciated! :grinning:
Eman

1 Like