How can i pull individual bacterial sequences from 16s processed data

jwdebelius · February 15, 2019, 10:18am

Feature selection is a big, complex topic. And, Im not sure if this is a question about feature selection, or pulling IDs for features you've already selected.

The second is far easier: your repset maps your feature names back to the sequences, and you should be able to just look up what you want there.

The first question is, alas, far more complicated and depends on a lot of things.

I want to start with the suggestion that if you don't see a difference in at least one metric beta diversity metric from a some what comprehensive list (weighted, unweighted, phylogenetic and taxonomic...), you probably don't have enough signal to go looking for additional features. Ive read a lot of papers that do that. But, essentially, if you don't see a different, you're going on a fishing expedition and it's unlikely to be worth much. You don't have to see the relationship in all metrics (i.e. you may see something in an unweighted metric that won't do you much good in a weighted one), but you should see at least one difference.

Okay, my community-level caveat out of the way, the next thing you want to keep in mind is that unless you've spiked in a compositionality control into all your samples, the data you have is compositional. The Gneiss tutorial and this review as well as a couple forum posts (here's one) do a nice job addressing why this matters. I'd probably start with the review, andf then work my way forward.

Based on that, if you're working with cross sectional data, ANCOM is usually a good place to start. It will give you the closest result to an actual target organism. I will occasionally run a cross-comparison or sensitivity analysis with ANCOM. So, if I know I've got a big difference in beta diversity in category X, and Im interested in category Y, I might test Y over stratas for X, or I might try to identify taxa that are different in both X and Y.
ANCOM does make the assumption less than 10% of your ASVs are significant when it makes the "significance" threshold. I will often set my own threshold (I like to present my statistic as a normalised W (the number significant over the total number) and then I set my level before the test at 0.8, meaning that it's significant with 80% of the ASVs. Its always worth plotting in a volcano plot.

Gneiss is a multivariate model, but might be harder for single feature selection since it focuses on partitions in the data. Isometric log transforms are sexy math, but make interpretation harder!

You could also try supervised learning to go pick features, although Im definitely not a person to speak to that.

A few final caveats. First, you want to make sure you validate your results in a second cohort. Id actually recommend working with a second 16S cohort before moving forward.

Second, most associations we look at are poly microbial, sort of like you've got poly-genetic traits. We don't always find the same overlap in organisms. This is more complex in 16s than genetics because there are more methodological issues here with clustering, naming, etc, but also worth considering. Just because you don't see a consistent 16s signal doesn't mean there isn't a community association (Parkinson's disease is my current favorite example, but there are plenty of others.)

Third, you're missing the genetic content. The HMP manuscript showed large variation in taxa but relatively stable gene content. Many organisms can play the same role in a community.

Best of luck,
Justine