Fragment-insertion SEPP for meta-analysis

horsemant · April 30, 2026, 8:03pm

I've read a number of posts on the forum regarding mixed V region meta-analysis efforts. I am working on a mixed V region 16S meta-analyses. I have about 30 studies that include 1,800 samples. There are a variety of V regions (V1V2, V3V4, V4) and primer sets (of those stated). I have performed DADA2 separately on each study. Initially, I merged study feature-tables and repseqs. I ended up with ~1.2 million representative sequences after merging. I assume this is because these are independently inferred ASVs from heterogeneous datasets. Regardless, I am at a methodological fork in the road whether to (i) merge all ASVs/rep seqs across datasets prior and perform a single SEPP insertion into a reference tree, or (ii) stratify datasets by V region and perform SEPP independently within each region prior to downstream analysis. Essentially, a two step meta-analysis for option (ii) - calculate summary stats then combine for a secondary analysis if that makes sense. If option (i) is recommended, should I individually filter rare features '--p-min-frequency 50' with '--p-min-samples 5'.? Yes, it will alter diversity metrics but my thought is more noise reduction. Ultimately, I know that there is no perfect way to account for primer/V region differences. I hope to get some guidance for moving forward.

Stefan · May 1, 2026, 7:38pm

Hi @horsemant , merging data from different V regions is always a challenge, even if they partially overlap.

I’d be in favor of your approach i) as I don’t see how you would reconcile the resulting insertion trees after processing V1V2, V3V4 and V4 individually. I’ve done so in The host genotype actively shapes its microbiome across generations in laboratory mice | Microbiome | Springer Nature Link section “Joint analysis with Robertson et al. data”.

Be aware, that ASVs between regions should be fully disjunct and you therefore will be limited to phylogenetic metrics - computed from one big SEPP (aka insertion) tree. But it will give you the advantage to perform a true joined analysis and not only a (multi step) meta analysis

However, I’d be careful with grouping samples across variables regions in downstream analysis. Comparing distances between samples of variable regions, i.e. one sample of V1V2 the other samples of V3V4, should be okish.

horsemant · May 2, 2026, 4:38am

Thanks for the reply, @Stefan. It was helpful and confirmed the direction I was leaning towards.

I will proceed with option (i) using a single SEPP insertion followed by UniFrac-based beta diversity and Faith's PD as the primary phylogenetic analyses.

In parallel, I am also considering a complementary genus-level approach, where I would collapse feature tables to genus level and merge genus tables across studies. Then perform MaAsLin2 to create a multivariable mixed-effects model to control for V region, study, etc....in order to get Shannon and Bray-Curtis metrics. Is this strategy appropriate to transform the study-specific ASVs into more 'normalized' framework for comparison? If not, I can live with phylogenetic metrics.

Thanks again for the guidance.

Tim

jwdebelius · May 4, 2026, 1:25pm

Hi @horsemant,

Not the best answer, but my group did a scoping review about this recently which might be relevant:

https://www.biorxiv.org/content/10.1101/2025.02.11.637740v1

We didn't find a right answer, but you can see options.

Best,
Justine

PS. Hopefully its okay to step in @Stefan

Stefan · May 7, 2026, 2:55pm

Hi @jwdebelius , quite the contrary. Thanks for your input and the valuable overview of strategies - even though the conclusion is to better agree on ONE variable region beforehand :-/

jwdebelius · May 8, 2026, 1:25pm

@Stefan, I mean, we've been arguing about regions since before I joined the field . But, yeah, if you can find a way to get yourself down to one region or find some other way to get a consistent scaffold, my suspicion is that it will decrease your variation. ...But you get a bias for that region