Open or closed clustering for metaanalysis

We plan to conduct a microbiome metanalysis with about 50 studies, which differ by their methods and sometimes by the region of 16S gene they are targeting. Given these limitations, we are wondering which clustering approach to choose and would like inputs. For the open-reference, we thought about doing the clustering for each studies independently, and concatenate them afterwards (we will care only about the prevalence of taxa at genus level, not the abundance). But we prefer the closed-reference to do all the studies together. But, is there a satisfactory reference database with the complete 16S gene. I would like to have several option to select the more suitable for my environment (to not lose too much sequence that will not hit the reference sequence collection)?
Thank you in advance,

Hello and welcome to :qiime2:

I see a larger problem, that needs to be addressed first.
afaik there are no quantitative methods to correct platform or primer bias. Due to these confounders, I doubt that the integration of heterogeneous data will lead to meaningful quantitative results.
Data quality in such a study should be a primary concern.


Thank you for your opinion. Our main goal is not to compare studies quantitatively, but to look at the microbial composition (we don't care here about the abundance, but simply want to identify which taxa are or are not present) to have a "universal" composition map of our microbiota of interest. Idk if the larger problem you identified is still a problem looking at our goal.

Yes, it is still a problem.
The composition might vary a lot, take a look at the Comparison of different hypervariable regions of 16S rRNA for taxonomic profiling of vaginal microbiota using next-generation sequencing | SpringerLink.



Hi @LBillet , hi @crusher083 ,

I agree with @crusher083 , you should be aware of and check for biases related to different methods and primers, etc. These are covariates to keep track of and statistically test to ensure that they do not covary with whatever your target variables are. This is another valuable paper re: methodological biases:

However, with ~50 different studies, I think that it should be possible to still make this comparison, as long as you ensure that your analysis is balanced (e.g., no significant covariation between technical differences and your target variables). So to answer your original questions:

Either open or closed-reference will work for studies that target the same marker-gene region. However, as you want to include studies with different primers only closed-reference will work (as you will need to map all inputs to the full-length 16S, but the de novo step that is performed second in open-ref clustering will lead to unique OTUs for the studies from different regions).

I think you might also have the method names reversed: closed-reference will allow you to analyze the studies separately and then merge later (as the feature IDs all map to the same reference). Open-reference must be done on studies that are merged pre-clustering, or clustered iteratively (i.e., the output of the first clustering run gets used as the reference for the second, etc).

Yes, SILVA and GTDB are both quite extensive, depending on the sample type that you are targeting.

Good luck!

1 Like