finding "most important" variable


I did miseq on a set of gut microbiome samples. I currently have my sequencing results and did denoising, filtering, classification etc.

My metadata currently contains many descriptions of each sample. For example, for each sample, i noted down the date of DNA extraction, date of host sacrifice, which primer plate I used, which primer well I used and many more details. My question is, is there a way I can use Qiime2 to filter the "unimportant"/ insignificant variables? In other words, I am looking for a way to only retain the "more important" variables eg where the sample was sourced.

I looked at the tutorials and the closest workflow I can find is Predicting sample metadata values with q2-sample-classifier — QIIME 2 2021.8.0 documentation. However, based on what I'm understanding, this workflow uses machine learning to predict which category a given microbiome is likely to belong to. I am instead trying to see which category is likely not necessary in affecting the microbiome. (so I guess the opposite of what that workflow does)

I hope I was clear in expressing my question. Thanks for the help!


Hi @Melissa_Soh,

This can be a complex topic! Personally, I like to first check my covariate sizes and relationships. Things with less than 5 per group are going to make for statistical testing hell: a lot of the common tests (i.e. kruskal-wallis) can't tolerate that small of a group size. I tend to go a step further and try to find covariates that have reasonably good balance (a 50:1 ratio is going to go in the same category asa 5 samples per group). I'd also double check covarying groups in your metadata. (You cna do this for sample metadata, but I'd try for technical metadata). In various places I've worked, we've combined extraction plates into sequencing batches, or had a large set of reagents or several plates. If you're not working on a massive project, I would probably look at either extraction plate or sequencing run as a potential technical covariate. (But i'd pick one and let that represent a lot of other things. Keep in mind that this works best if you've got semi randomized samples; if all your cases are on one plate and all your controls are on the others, you wont be able to distinguish the techncial and biological effects as easily.) You might also want check the literature: look at some previously published studies of technical factors in your animal model, and use it to further narrow your focus there.

Okay, so variable limiting out of the way, next is variable selection. I tend to use the adonis function because it gives me an R^2 effect size, which I can use to rank my covariates. You can see it in Figure 1 of a recent paper of mine or in figure 1 from He et al. In my case, I adjusted for age, sex, and sequencing plate off the bat because my collaborators were convinced of an age and sex effect and sometimes it's easier to throw it into the model than to argue, and then run because it was a simple variable that encompassed a lot of technical variation. I ended up picking the variables that were 60% of the disease variation across the three distance metrics I tested (unweighted UniFrac, weighted UniFrac, and Bray Curtis), which gave me 7 variables to fit for my ~1000 samples.

The caveats (because there always are a few :slightly_smiling_face:) is that the order of variables matters for adonis. So, if you're pre-adjusting your data for run like i did, your formula would need to be run + var. If you're working with unweighted metrics (unweighted UniFrac, Jaccard), you may also want to try adjusting for depth or alpha diversity, since those can affect your observed value. Finally, if you're working with coprophagic animals who live together (mice :mouse:, rats :rat:, apparently chickens :chicken:, etc), you need to keep the fact that they live together in mind.

Hopefully this helps at least somewhat with variable selection, and please come back if you want to bounce more ideas!


Hello @jwdebelius,

Thank you so much for your help! Your suggestions were really clear and I feel more confident regarding variable limiting now.

Regarding the adonis function, I am not sure how to decide on the order of variables. I searched online but cannot find sources with a clear explanation. Do you have any links/ suggestions?

Cheers! :microbe:

Hi @Melissa_Soh,

I think the order of variables things in buried somewhere in the vegan documentation. (The main reason I know is because a statistician friend of mine decided to read the full vegan manual a while ago, and came back with a ":warning:did you know ")

In a classic model, if you want to adjust data, you can add your variables in any order thanks to the transitive property. So, if you model z = :cat: + :rat: and z = :rat: + :cat:, the values you fit for :cat: and :rat: should be the same regardless of order.

However, the way adonis calculates its effect sizes and p-values, it uses the order of variables to determine priority in the fit. (I think successively fits residuals, but double check the documentation/code yourself for exact methods; the main point is that the order matters.) So, with adonis, dm = :cat: + :rat: and dm = :rat: + :cat: may not give you the same values for :rat: and :cat:, especially if they co-vary.

The way I tend to approach this personally (YMMV) is to make sure my variable of interest is always the last variable in the model. So, if I want to adjust my diagnosis (diagnosis) for age (age_group), sex (sex) and sequencing run (run_id), I would code my formula as run_id + age_group + sex + diagnosis.
If I thought smoking (smoker) might be a confounder and wanted to add it to my model, I'd code it as run_id + age_group + sex + smoker + diagnosis.



Hello @jwdebelius,

Thanks so much! That has been really helpful. I am going to dive into the vegan documentation for more info :diving_mask:


1 Like