I'm developing a bioinformatics pipeline to analyze raw data coming from IonTorrent 16S Metagenomic Kit. This kit amplifies 7 regions using two primers pools (Pool 1: V2, V4, V8 and Pool 2: V3, V6-7 and V9). I have been reading that one of the biggest problems of this kit is that primers sequences aren't provided to users which make difficult the analysis. A possible solution has been proposed by @KMaki and partners in her recent work to separate sequences by V region before importing into Qiime2.
My question is related with how to perform the downstream analysis. I have read that some researches analyze sequence quality in each V region and then, they select some of them to generate the feature table and developing further analysis (barplots, alpha/beta diversity, differential abundance...). In this paper paper, they suggest that V2, V4 and V6-7 regions report more consistent results than the other regions, so it could be a possible strategy to select V regions. I think that we could be losing biological information using this selection approach.
However, developing 6 downstream analysis using 6 feature tables (one per V region) will generate a vast amount of results that can be a nightmare when we need to interpret them. For example, if we are working with clinical data when comparing healthy and cancer patients, this could be tedious to interpret.
Is there a way to integrate the dada2 preprocessed results (feature table and representative sequences) from each V region and generate a "consensus" feature table?
I have good news! This multi-region kit has been discussed in great detail, and some methods have been proposed within Qiime2 to deal with it.
The bad news is that this is still pretty difficult and there is no standard method for addressing this.
EDIT: But there is a new plugin you should try: q2-sidle
Correct.
No, because the regions themselves don't overlap so you can't tread it like genomics and 'assemble' them into a full-length 16S gene.
You could place your disparate ASVs into a tree, but that still does not create a single ASV from the 7 regions you sequenced.
EDIT, which is where sidle comes in:
we present Sidle, an implementation of the Short MUltiple Reads Framework algorithm with a novel tree-building approach to reconstruct rRNA genes from individually amplified regions.
Thank you for your suggestions! I have read this super interesting post, the information about q2-slide and other related post about this kit in the forum. It seems that many questions remain opened about how to analyze the data coming from this technology.
I understand it. According to this post of @MiriamGorostidi, the software used by Ion Torrent (Ion Reporter software) add up the result of each V region to create the "consensus" feature table. As other users have commented in other posts, this approach would overestimate the richness. In addition, I think that if we started the analysis using all V regions together (using FASTQ files just as the kit reports to users), we'd see an overestimation too.
At this point, I have two questions, one more practical and one out of curiosity):
Do you think that the best option to develop a downstream analysis using a feature table is to select a table from only one specific V region? What parameters should I keep in mind in this hypothetical selection (richness in each V region in my own study, previous studies comparing V regions...)?
What advantages does this kit have over others unless we want to compare how V regions perform between them? I think that if more regions are sequenced, the possibility to detect more taxa is higher, but If I have not the guarantee of obtaining a reliable feature table from all these regions (one of the main data types in 16S analysis), maybe it would be easier to work from the beginning with two primers (for one or more V regions) and keep things simple, as far as a reliable workflow analysis can be established.
I think we are on the same page about the problems with this kit.
Perhaps. Choosing a region with broad-phyla coverage for all taxa, then adding a second region for specific taxa of interest may be worth a try. (So, select regions based on the expected resolution of taxa. Presumably Ion Torrent has literature on this.) Review three will have questions, but that's normal.
I don't like this kit because I'm keenly away of the problems it causes. But I'm on the numerical ecology side of things, so these are the problems I think about. A traditional microbiologist might like the extra taxa resolution without regard for the statistical conundrum.
In the three years since that post, cross-region analysis has improved and q2-sidle/smurf engages with this problem directly. Awareness in this field of compositionality has also improved, and this kit's problem of presenting a taxonomy table that is multi-region compositional may not be a problem forever.
Perhaps I am at fault for being a grumpy armchair statistician. This kit has the same problem that all of meta-*omics has: each observation can come from multiple sources, and all of this is convolved in the measurements you make. And yet, the field is making reasonable progress!
Thank you so much for your answers @colinbrislawn! I really hope that this problem can be solved in the near future because I think that to put together feature counts from different V regions in a reasonable way would be valuable. Meanwhile, we'll have to look for alternatives .
Hello, a brief observation, I used q2-sidle for a similar set of data, and the results (especially the taxonomy profiling) were consistently better than those obtained by using specific regions. Just making the filogeteic tree was a challenge that I could not resolve.
Sidle and the SMURF algorithm takes the counts from the disjoined regions, aligns them to a reference, and then uses expectation maximization to solve the abundance. The multi-region approach gives you increased specificity, but it also designed to handle differences in regional coverage/database coverage.
The problem with Sidle for you is that it currently requires primers with Ion Torrent doesn't provide. It's on my list for Sidle, but probably not until I finish an earlier PR and get the manuscript re-submitted. Happy to share throughts/coding if some else wanted to take on that particular challenge, though. I have some cool partial tools and thoughts, I just have been busy.
The trick to the tree in Sidle is that if you 're using Silva and want a tree it must have a corresponding insertion tree backbone. The ccurrent (March 2023) version that has a matching backbone is 128.
I think that q2-sidle is a very useful tool but, as you mentioned, I checked it last week and I realized that I needed primer sequences, so I had to discard its use. Only a question about q2-slide because I only had time to skim the documents and I read that it generates a reconstructed table and associated taxonomy. Is this reconstructed table a kind of "consensus" table as I referred in the subject of this discussion?
As far as I know, the primer sequences are proprietary. I read in other posts that there are some papers where we could infer these sequences but there is no document that state it in a "official" way. I have used a plugin to separate the V regions published this year by @KMaki that is partially based on that these sequences are still unknown.