Alpha/beta diversity: filtering datasets, additional metadata post analysis and missing metadata values

Hi Qiime community,

I am preparing to run the alpha and beta diversity metrics on my 16S V4 dataset but had a couple questions about the set up to answer the questions of my study. My study consists of sampling bacteria populations in lake water samples across 3 lakes and from 3 size fractions per lake ( whole water, >20um and <20um). Each lake was sampled once a week for several months.

First I would like to look at how the diversity of the bacteria populations vary between lakes and between size fractions which I was planning on running the diversity core-metrics pipeline looking at categorical columns LakeName and Size fraction. I would then like to look at differences between size fractions within each lake, and between lakes within each size fraction which is were I run into my first question: would it be more appropriate to filter the feature table into each lake / size fraction ( have 3 separate lake datasets and 3 separate size fraction datasets) and then rerun the core-metrics pipeline on each sub-dataset or add a third column combining LakeName and SizeFraction in the original core-metrics analysis and then look at the results per group ( for example the size fractions comparisons for each lake)? From what I have read in previous discussions I believe the filtering may be the better option and if so I wanted to check that when rerunning the core-metrics analysis I should provide a new sampling depth for refraction specific to that group of samples. Additionally do I also have to filter the metadata file to each sub-group of samples?

My second question regards diversity correlation to continuous variables. I have both time dependant variables which vary due to season ( i.e. water temperature) and time independent continuous variables ( i.e. plankton abundance). Should the qiime diversity alpha-correlation/ qiime diversity bioenv commands be used for all non time dependant variables and the longitudinal analyses for time dependant data or since all the data is part of time series the longitudinal analyses should be used for all continuous variables?

Further I am still waiting on results for some of the continuous variables. Is it possible to run the core-metrics pipeline on the variables I have the data for now and then use the rarified table output and an updated metadata file with the new continuous data later with the alpha and beta diversity scripts? or would I have to restart with the core-metric pipeline with the metadata files with all the variables? I was concerned about the samples being rarefied differently between diversity calculations.

Finally my last question is how to work with continuous variables in which not every sample being analyzed has a value. For example for one lake I have no temperature data and for another lake I have temperature data for some dates but not all. In the case of the lake with no temperature data will this just not appear in the comparison? For the lake where some dates temperature is missing will these dates just be ignored? Or is it better to filter the table for samples containing temperature data and re-running the core-metrics pipeline?

I apologize for the long post and any help on all/any of these steps would be much appreciated!

I prefer the additional metadata column — it will be more informative about overall differences within and between lakes and size fractions. beta-group-significance will perform pairwise permanova tests, so this will still tell you about whether individual fractions in individual lakes are different from each other, etc…

filtering and re-analyzing will be more useful if, e.g., you do the total analysis and see differences between some groups but your PCoA plots are a nasty tangled ball… then you could filter and re-run with subsets for ease of visualization.

beta diversity results will change any time the input samples change. Alpha diversity will not. So at the very least run alpha diversity on everything.

you could but I would discourage it. It gets rather messy for reporting purposes and would be misleading in publication if you are reporting different diversity results (especially alpha diversity) with different rarefaction levels.

No, metadata can be a superset.

yes. As far as I know, the metadata file is only used for labeling samples in the emperor PCoA plots that are produced by core-metrics. It is not used in any way during rarefaction or diversity estimation. So your diversity results will always remain the same.

When you update your metadata file you can recreate those emperor plots by using the output pcoa results files from core-metrics (e.g., bray_curtis_pcoa_results.qza) and using emperor plot to build a new PCoA plot with the new metadata file.

That will matter at the statistical testing stage, not at the diversity estimation stage, so it does not matter for running core-metrics

For the most part, missing values are ignored, but it really depends on what plugin/method you are using. alpha-group-significance the most relevant one for you, will ignore missing values.

I hope that helps!


Thank you this provides a lot of clarification!

I ran the core metrics on my whole dataset and started working through alpha group significance which led me to some further questions about working with the whole dataset vs filtered datasets. Here is an example of the results that I received when testing for significance between lake communities( this includes all size fraction per lake) for the observed OTU metric

and the Kruskal-Wallis pairwise results:

Group 1 Group 2 H p-value q-value
LakeAgawam (n=120) LakeCentralPark (n=36) 0.37201103 0.54191012 0.54191012
LakeAgawam (n=120) LakeErie (n=90) 21.0333604807 4.51355588658659E-06 1.35406676597598E-05
LakeCentralPark (n=36) LakeErie (n=90) 5.5696005602 0.0182749292 0.0274123938
  1. I was not really sure how to interpret the significant p values for LakeAgawam/Lake Erie and LakeCentralPark/LakeErie when based on the boxplots there is a lot of overlap between the lakes? I thought this may in part have to do with the fact that all 3 size fractions are be included per lake and therefore not really lake replicates which could be throwing off the analysis. Instead the comparison between lakes within the whole water may be a more appropriate question which I could answer with the third Lake/SizeFraction column mentioned above. Is this a correct interpretation of why the significant results do not look significant?

  2. Second if using the third metadata column approach which would run pairwise test on all combinations of lakes and size fractions do I have to worry about the additional comparisons instead of the 3 comparisons when running per lake from a stats standpoint? I have a fairly limited understanding of statistics but I know the more comparisons the greater chance of error which is corrected for by the q value, however if I do not plan on comparing for example the LakeAgawam Whole sample to the LakeErie <20um should I try avoiding calculating this comparison? That was my thinking for doing the filtering.

Also a follow up question about the additional metadata:

  1. So just to make sure I understand this correctly once I have my additional metadata I can just pass the updated metadata file to qiime diversity alpha-correlation command with the alpha diversity vector file created by the core_metrics pipeline ( or to alpha group significance if I was adding a categorical column)? The core metrics only calculating diversity per sample and then any of the statistical alpha or beta diversity commands are actually using the metadata file to group and calculate statistics?

The boxplots and p-values make sense to me… the mean # of observed OTUs for those groups are not the same. Even though there is a lot of overlap, Erie definitely looks lower, and you have a fairly large sample size of power this comparison.

That definitely adds noise, but it seems you still see significant differences.

You can test it both ways. You can see here that alpha diversity is significantly lower in lake Erie than the others, even when you do not account for fraction. The Lake/SizeFraction will probably make the differences more pronounced, but reduce your sample size so you might actually lose significance.

With an N of 6, the multiple test correction should not make much of a difference. If you have borderline significance that is made insignificant after correction, I’d say manually correct those p-values with the actual pairwise comparisons that you would have planned to perform from the start (because I agree, comparing diversity in different fractions in different lakes is probably not something you would ever want to test). But in any case don’t let worry over multiple test correction stop you from running everything together for convenience… running together and correcting later (if needed) is easier than splitting and testing now. (or run everything now and then filter and test later)


I hope that helps!

Once again this provides a lot of clarification. Thank you for the detailed explanations.

I have one final question as I am working on running alpha diversity significant testing using continuous variables. As with the categorical values I would like to look at how diversity differs amongst all samples but then also refine by lake and size fraction, which worked great by adding a new metadata column (LakeSizefraction) by concatenating the 2 columns of interest which partitioned my samples further into the subgroups. However I don’t think I can do this for the continuous variables since the correlation command is not grouping based on identical values. If I want to examine how diversity correlates to continuous variables in these subgroups of samples ( i.e. per lake) for this would I have to filter my feature table and rern core_metrics?

Indeed, concatenating continuous values would not work, unless if you intend to then use this as categorical data.

Since you are doing this for the sake of grouping samples for alpha diversity analysis, I think the most sensible thing to do is probably run a multi-way ANOVA or similar test with your alpha diversity values as the dependent variable. You will need to export your data and run this in R or another external program — nothing in QIIME 2 can run that method currently.

I hope that helps!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.