Using multiple 16s variable regions for analysis

jwdebelius · January 9, 2024, 6:08pm

Thanks for the tag @lizgehret! Yeah, I've spent a lot of time thinking about this in the last year. Attempting to write a paper, TBH, but its slower going than I expected.

So, @Rakaya, like Liz said, the issue in directly combining the two regions is that you're not going to have an ID overlap if you only use the ASVs. Remember from ESVs should replace OTUs that the ASV ID is that single nucleotide sequence. ASVs that differ by a single nucleotide are going to be identified as different, and so ASVs with no overlap are just going to be different.

So, you need a way to provide a common ID for those features. There are 3 options that most people seem to use.

	Closed Reference OTUs	ASVs with phylogenetic Scaffolding	Genus-level taxa
Seminal Reference	Lozupone et al, 2012	Jansen et al, 2018;	Several, see Wang et al
What it does	ASVs are clustered against a common reference database that are shared across multiple regions; sequences that dont matcht he database are discarded	ASVs get inserted into a reference backbone that spans multiple regions. ASVs that dont fit in the tree are discarded (although these are often low quality ASVs)	The taxonomic assignment for the ASV are used to collapse to a genus level or higher
Can it be used for UniFrac distance?	Yes (tree from OTU databse)	Yes (insertion tree)	No
Can features be compared across regions w/o phylogeny	Yes (reference ID)	No (ASV IDs are region specific)	Yes (taxonomic names should be common across regions)
Strengths	Feature-level resolution possible for everything; has been used frequently; computationally effecient	Lets you keep ASVs in high quality placements; easy to combine with collapsed data	Annecdotally best at minimizing region-to-region differences
Limitations	Reads that dont match the database are discarded, so you need a good database; lower resolution that ASVs; can sometimes lead to big regional effects	Really only useful for phylogeny-based analyses, must be combined witwh something else	Loss of resolution may limit biologically meaningful conclusions
Key qiime2 plugins	q2-vsearch	q2-fragment-insertion	q2-taxa

So, I think the answer for your specific question in region combination is, as always, it depends on what you want. I think for your beta diversity/core microbiome work, your best bet is to either work on collapsed data, or to move to OTU clustering. With both, you may need to consider if there's a regional or study adjustment you need to make (database effects, reagent contamination, etc) and how to model that.

Best,
Justine