How can i pull individual bacterial sequences from 16s processed data

codea · February 15, 2019, 4:41am

Im looking to pull out individual sequences from post processed 16S data. I want to pull out taxa of interest to create markers from, however i do not know how to search for these individual sequences.

Any help is appreciated

jwdebelius · February 15, 2019, 10:18am

Hi @codea,

Feature selection is a big, complex topic. And, Im not sure if this is a question about feature selection, or pulling IDs for features you've already selected.

The second is far easier: your repset maps your feature names back to the sequences, and you should be able to just look up what you want there.

The first question is, alas, far more complicated and depends on a lot of things.

I want to start with the suggestion that if you don't see a difference in at least one metric beta diversity metric from a some what comprehensive list (weighted, unweighted, phylogenetic and taxonomic...), you probably don't have enough signal to go looking for additional features. Ive read a lot of papers that do that. But, essentially, if you don't see a different, you're going on a fishing expedition and it's unlikely to be worth much. You don't have to see the relationship in all metrics (i.e. you may see something in an unweighted metric that won't do you much good in a weighted one), but you should see at least one difference.

Okay, my community-level caveat out of the way, the next thing you want to keep in mind is that unless you've spiked in a compositionality control into all your samples, the data you have is compositional. The Gneiss tutorial and this review as well as a couple forum posts (here's one) do a nice job addressing why this matters. I'd probably start with the review, andf then work my way forward.

Based on that, if you're working with cross sectional data, ANCOM is usually a good place to start. It will give you the closest result to an actual target organism. I will occasionally run a cross-comparison or sensitivity analysis with ANCOM. So, if I know I've got a big difference in beta diversity in category X, and Im interested in category Y, I might test Y over stratas for X, or I might try to identify taxa that are different in both X and Y.
ANCOM does make the assumption less than 10% of your ASVs are significant when it makes the "significance" threshold. I will often set my own threshold (I like to present my statistic as a normalised W (the number significant over the total number) and then I set my level before the test at 0.8, meaning that it's significant with 80% of the ASVs. Its always worth plotting in a volcano plot.

Gneiss is a multivariate model, but might be harder for single feature selection since it focuses on partitions in the data. Isometric log transforms are sexy math, but make interpretation harder!

You could also try supervised learning to go pick features, although Im definitely not a person to speak to that.

A few final caveats. First, you want to make sure you validate your results in a second cohort. Id actually recommend working with a second 16S cohort before moving forward.

Second, most associations we look at are poly microbial, sort of like you've got poly-genetic traits. We don't always find the same overlap in organisms. This is more complex in 16s than genetics because there are more methodological issues here with clustering, naming, etc, but also worth considering. Just because you don't see a consistent 16s signal doesn't mean there isn't a community association (Parkinson's disease is my current favorite example, but there are plenty of others.)

Third, you're missing the genetic content. The HMP manuscript showed large variation in taxa but relatively stable gene content. Many organisms can play the same role in a community.

Best of luck,
Justine

Nicholas_Bokulich · February 15, 2019, 12:23pm

Hi @codea,
It sounds like you do not want to do feature selection, it sounds like you have a specific set of taxa in mind that you want to select from an existing FeatureData[Sequence] artifact. Is that correct? If so, see this tutorial.

If you do not have that set of taxa in mind and do want to perform feature selection/differential abundance, see @jwdebelius's excellent description.

codea · February 18, 2019, 2:25am

Hi there. Thanks all for your input. Yes i do have a specific set of taxa in mind. I will try the tutorial you have set out for me here.
Will this tutorial give me the taxon sequences i desire from my 16s data?

Thanks in advance

codea · February 18, 2019, 3:03am

Hi Justine,

This section of your answer sounds most fitting, however, how can i 'look up' those sequences from here?
I can download these repset maps as a fasta file, but i dont know how to do a mass blast of all the sequences. can you provide any input as to how?

Thanks for your help!

jwdebelius · February 18, 2019, 10:18am

Hi @codea,

If you're working the way I think you're working, you should have sample IDs already in mind. Then, you should be able to grep or find and replace, or whatever those IDs. Like, ANCOM should (I think) give you a list of corresponding IDs (@Nicholas_Bokulich, please confirm; I tend to run ANCOM in scikit-bio).

Best,
Justine

Nicholas_Bokulich · February 18, 2019, 1:37pm

yes, ANCOM will give you a list of feature IDs that reject the null hypothesis.

codea · February 22, 2019, 1:43am

Is anyone willing to collaborate with me on this to show me how this is done? Im willing to give authorship.

Thanks in advance.

Nicholas_Bokulich · February 22, 2019, 1:43pm

Hi @codea,
You can see an example of using ANCOM here. It sounds like you already have your feature table, so an ANCOM tests should be straightforward unless if you have a really complicated design.