Context & Motivation
I am currently working on a large-scale meta-analysis involving hundreds of projects and over 10,000 samples. My dataset is a mixture of:
V3-V4 region (e.g., amplicons from 341F/805R)
V4 region (e.g., amplicons from 515F/806R)
The primary goal is to perform a unified analysis across all studies, including alpha/beta diversity and taxonomic comparisons.
The Challenge
As these datasets target different (though overlapping) hypervariable regions, simply merging the feature tables leads to massive batch effects. Previously, the "standard" approach was to use qiime feature-classifier extract-reads to trim V3-V4 sequences down to the V4 region. However, I am concerned about the potential loss of taxonomic resolution and the strictness of primer-based trimming for such a diverse dataset.
Proposed Strategy: Greengenes2
I am planning to use the Greengenes2 (GG2) framework (via q2-greengenes2). My understanding is that the non-v4-v4-asvs pipeline can perform phylogenetic placement of these disparate fragments into a single unified backbone tree.
My logic is: By anchoring both V3-V4 and V4 ASVs to the same reference phylogeny, they should become comparable in a shared coordinate system without manual trimming.
Specific Questions
Is Greengenes2 currently considered the "Gold Standard" for handling mixed-region 16S meta-analyses?
Taxonomic Consistency: In your experience, how well do V3-V4 and V4 reads from the same biological source cluster together once placed on the GG2 tree?
Computational Performance: Given the scale of ~10,000 samples and potentially hundreds of thousands of ASVs, are there specific memory or threading considerations for the non-v4-v4-asvs or fragment-insertion steps?
Thank you for your time and for developing these incredible tools!
I apologize for a delay in reply, for whatever reason I didn’t get notified of this until this week.
Very cool but a recipe for technical bias as you note
This is a reasonable concern but as far as I know no one has demonstrated it is actually a problem. 16S no matter what you do has resolution limitations anyway.
No, this performs closed reference OTU picking.
Absolutely! Phylogenetic coordinates are far richer and more meaningful than taxonomy.
We’ve published on this. I’m unaware of a publication demonstrating it works for other references, and it could matter considerably as placement is almost certainly sensitive to tree construction.
This paper comes to mind but there may be others. I would encourage deferring concern of taxonomy until discussing results, rather than focusing on it for analysis, as taxonomy is low resolution and subject to more bias than phylogeny.
I would not use closed reference for this, but instead use placement. Scaling will be dependent on the number of ASVs not necessarily the number of samples. We regularly place ASVs from ~50k human samples in the AGP pipeline using SEPP on a single machine in reasonable resources so I wouldn’t worry to much about it. If scaling is a problem, you could place sets of ASV independently and then reconcile the placement data after the fact – I’ve done this for work in review and could share the code if necessary.
Here are my thoughts and recommendations. You’ve done a very good job of identifying critical areas. There is a very real possibility though that no matter what you do to the data that you will not be able to reconcile the technical difference directly; even if the variable region is reconciled there are many other sources of bias for meta analysis. QIIME1 allowed for detrending, a capability I don’t believe has been ported to QIIME2, which may be applicable but I would only consider that after all other options have been exhausted. If you have paired samples where the same physical specimen was sequenced for both variable regions there are additional options available, but it isn’t clear if that is the case here.
As to recommended approaches, here is what I would do:
Ensure your metadata contain a variable describing what variable region is associated with each sample
Determine in advance what you anticipate to be the highest effect size variable that would stand out no matter what.
Compute ASVs on all the samples, place using SEPP into Greengenes2, see this post for guidance.
Within each variable region (i.e., NOT a joint analysis), compute unweighted and weighted UniFrac, and confirm your anticipated high effect size variable separates the data in PCoA and confirm this with PERMANOVA. If this does not work then I strongly advise figuring out what is going on within a given variable region. Are some of the samples bad? Are the metadata wrong? Is there a mislabeling? Are there other metadata variables which should be considered? etc
After confirming a high effect size variable works as anticipated within a variable region, perform a joint analysis, compute unweighted and weighted UniFrac, check whether the high effect size variable still separates the data and separates it better than the metadata variable that encodes the variable region. Check this with PERMANOVA.
If this works, then consider proceeding to analyzing and interpreting the data in more detail.
If the above does not work, the next thing I would do is constrain the ASVs to the same variable region, and maximize the length of the resulting ASVs. Realistically, this means trimming the V3 data on 5’ to 515F, and then trimming all data on 3’ so the ASVs are the same length. This might be possible with extract-reads, but I’m not really sure. After doing this, retrying the above with placement and subsequent steps.
If these placement approaches do not work, there are a variety of heuristic tools available beyond detrending to perform batch correction. If going this route, I recommend taking a look at DEBIAS-M.