16S rRNA sequencing analysis theoretical question

Angelica · December 2, 2020, 7:22pm

Hello,
I have a theoretical question about 16S rRNA sequencing analysis. If we use primers for multiple regions of the gene, ex V2,3,8,9 etc, then in our samples we will have multiple reads possibly for each species that is present. How in the end will we be able to know exactly how many cells we had in our sample from each species since we cannot guarantee that they all were sequenced with the same success? And what about overlaps of specific V regions between closely related species?
Thank you in advance

jwdebelius · December 3, 2020, 4:47pm

Hi @Angelica,

You've got a couple of issues here.

You can't; you won't. Even if you used a single region, the number of 16s genes are not related to the number of cells as a general rule. I'd recommend looking at Quantitative microbiome profiling links gut community variation to microbial load and specifically recommend Figure S5. There are caveats. If you don't have a lot of bacterial cells, you will have trouble getting DNA to amplify. There are low biomass protocols that can help with this issue, but my experience has been its closer to binary. The quality if your extraction affects the final read count, but less os than you'd think.

You can check the number of 16s genes in your original sample using qPCR (which does not directly correlate to the number of cells because cells have multiple 16s genes and copy number variation is always a fun discussion, see posts below). Or, you can do flow cytometery like the paper above talked about, if you can get the protocol to work and can find a flow cytometer that will let you run bacterial cells.

This is kind of a separate challenge right now. Finding appropriate ways to combine multiple reads is really difficult as of 3 December 2020. I'll link the ion torent thread below where they work through a potential pipeline.

This is one of those generalizability/specificity problems. Its constrained by read length (longer reads -> more specificity), the region/primers you chose, and what you believe you need. There are cases where having that resolution is critical to the biology and cases where we don't know. Picking your hypervariable region depends on a lot of things, including:

Standard for your environment (for example, the vaginal people have primers they really like)
How much you want to compare across enviroments (EMP are designed to work lots of places but may give you lower resolution)
Specific primers can focus on specific clades of interest, do you need that specificity?
What does your database cover?
How long is your read length and will you be able to scaffold (V13 tends to be longer than a 2x300 Illumina run; V4 2x150 tends not to join)
Potential off target effects.

Best,
Justine

https://forum.qiime2.org/t/possible-analysis-pipeline-for-ion-torrent-16s-metagenomics-kit-data-in-qiime2/13476/83

Angelica · December 3, 2020, 9:29pm

Thank you very much for your detailed answers! It is really helpful

How can we know then for sure species abundance? If each cell has multiple genes to amplify and they are not always the same within the same cell, and we cannot assemble reads from different V regions for each species, how can we tell which genera are more abundant in a sample?

Is it then better to use one variable region with a good enough length to be accurate for 16S rRNA studies?
One last question, I have used DNA extracted from standard cultures, for the evaluation of sequencing and analyses methods and I have seen that even though I detect all the species I should have, I still get some in very low amounts that shouldn't be there. I believe that taking into account all the above said, it is possible that some reads are incorrectly classified into different species due to intracellular diversity of 16S rRNA genes as well as possible other details regarding the process used for classification such as selected quality of reads, length discarded, sequencing platform error rate or other errors of algorithms that have not yet been overcome. Just to be sure, is it possible to always have a few false positives even in standard cultures samples analysis using QIIME2 and yet be doing everything as best as possible in all the steps of the analyses "as of 3 December 2020"?

jwdebelius · December 3, 2020, 10:57pm

Hi @Angelica,

We do the best we can. Like any science, a microbiome sequencing experiment is a model and all models are wrong, it's just that some are useful. Some with that philosophy, we try to recognize our biases and work with them. I'm not trying to be disheartening, but a lot of days, I feel like an explorer who is trying to help build a . A lot of people want gps , and we're doing the best with the tools we've got .

So, again, let's look at some of the questions...

Okay, philosophically, species are kind of hand wavy, especially for bacteria. My understanding from macroscopic ecology is that a species is something that has independent reproduction that produces fertile offspring. (So, like even though + can have babies, a liger can't). In bacteria, sex is way more complicate and the "species level" approximation is something useful to put things in nice boxes. Personally for amplicon sequencing I'm a big fan of amplicon sequence variants which essentially become barcodes for very specific parameters. Like, it can be useful to have a name. You can get an approximate name in most environments (this also depends on your database), but at the end of the day, as Shakespeare said, "A rose by any other name would smell as sweet " and whatever we're calling that barcode this week still looks pretty nice in a vase on my table .

My next question is whether organism abundance always matters, and if ecology is always a question of different organisms. I've worked on lots of projects where the thing that characterized unhealthy was an instability. This is kind of encompassed in the Anna Karenina hypothesis. You may do your analysis and discover that the community state, rather than the organism, was the signal after all. ...And for that, molecular barcodes often work better than a species name.

And, if you have a case where you do have evidence for specific organisms, causing a difference, I think again, this is complicated and it depends on the question you're asking. If you're looking for something specific with a known infectious dose, than you want a technique that specifically profiles that organism (For example a gene or species specific qPCR or ELISA). So, like, if you've got oral samplse and you specifically care about P. gingivals, go look for that with specific primers. But, if you want to understand the relationship between organisms and community states, rather than the absolute abundance may not help you. You may want to read more about how to work with compositional data. Microbiome Datasets are compositional and this is not optional is required reading for the field, IMO. I alos think it's worth looking at the Songbird Paper to see how we can address some of these challenges.

I think by March of next year, there should be a computational tractable implementation that will let you scaffold multiple fragments. So, if it were 12 months ago, I would say a single hypervariable region would be the answer. For an analysis today, I would chance it. the worst that happens if you sequence multiple regions and there isn't a good technique is that you end up using one and have wasted money. If that's a major concern, than maybe that's your answer.

There are two possible reasons. One is misclassification or a lack of resolution. So, like, if you're getting an unclassified or unspecified f__Enterobacteraceae instead of specifically an E. coli then it's because of a resolution thing. If you're supposed to get Lactobacillus inners and get Lactobacilus jensenii instead, your regional resolution, or simply misclassification.

But, like, there are also some pretty common contaminants that cause problems, specifically in low biomass systems. How much this matters to some degree depends on what enviroment you're working with. A high biomass environment like adult feces or soil is more robust to reagent contamination than something lower biomass like a biopsy, a placenta, or a clean room. A couple good starting references there are the Salter paper, which is kind of a class, and Kathroseq, dealing with low biomass specifically. In generally, you may find a best practices reference for lab work helpful; I probably need to find a new one but google scholar probably knows.

I realize this post is probably not going to put you at ease. Like so many things in life, microbiome research is kind of like going to school with Ms Frizzle and her magic school bus you have to be prepared to take chances, make mistakes, and be messy. And hopefully as we make mistakes together, we'll move things forward.

Best,
Justine

Angelica · December 4, 2020, 1:51pm

Thank you so so much for your answers and your time!

That's exactly how I feel about microbiome research!

That's true and I believe it is most helpful
Thank you very much for all the references I it's exactly what I need