Arbitrary ASV fragment placement in Greengenes2

callaban · September 6, 2024, 10:43pm

Made in conjunction with @wasade and Siavash Mir Arabbaygi

Greengenes2 provides a rich collection of already placed V4 fragments using DEPP based on EMP 515F/806R. However, the set of fragments placed is currently fixed, meaning that users who have fragments from a different primer set are unable to utilize higher precision phylogenetic information from placement. DEPP, though powerful, currently requires a computationally expensive model trained per region used for placement, although a general region agnostic variation is in the works.

Prior research has demonstrated the value of fragment placement for amplicon sequence variants. So to address this important limitation of Greengenes2, we re-generated a SEPP from the Greengenes2 2022.10 backbone. We then tested the efficacy of these new placements as assessed using q2-fragment-insertion, against the existing DEPP placements. For the examination, we used a 16S V4 dataset from murine urine specimens Forster Lab - Pitt / Wolfe Lab - Loyola.

To compare the methods, we computed a single rarefaction per feature table, followed by computing Weighted UniFrac, and principal coordinates. A Mantel Test (Spearman) yielded a reasonable correlation of rho=0.93 p=0.001, suggesting that overall the sample-to-sample relationships observed in the datasets are preserved.

We then compared the ordinations using Procrustes analysis, and similarly noted a disparity is M^2=0.086 p=0.001. Note that with disparity, a smaller value indicates a better fit whereas as larger value indicates a worse fit (datasets are often considered to have a good fit for M^2 values less than around 0.25). These results further support the Mantel data, suggesting the overall sample-to-sample relationships appear consistent regardless of using the SEPP or DEPP placements.

It is important to note that, though this is a “best case scenario” where we’re using the same samples, we do not expect perfect correlation or fit because each distance matrix is the result of an independent rarefaction. It is further plausible that some of the fragments available to SEPP were not included in what was placed in Greengenes2 as that set is old now.

Given these observations, it now seems viable to use SEPP with Greengenes2 to obtain a phylogeny from arbitrary amplicon sequence variants. A Greengenes2 2022.10 reference compatible with q2-fragment-insertion can be obtained from the Greengenes2 FTP.

For V4 fragments that are already in the tree using DEPP, phylogenetic taxonomy outperforms Naive Bayes. We show here that the SEPP placements are highly correlated with the DEPP placements. We have not yet evaluated whether this is also true for fragments other than V4, and therefore we advise use of Naive Bayes taxonomy assignment for these other fragments at present.

What we observe is a strong correlation between the results using both methods - Procrustes M^2=0.086 p=0.001, and a Mantel rho=0.93 p=0.001.

Similarly, we also looked at V3V4 urine samples. We compared closed reference Greengenes2 to SEPP insertion comparing weighted UniFrac distances/pcoa for both Procrustes and Mantel tests.

We observe strong correlation in the results using the two methods - Procrustes M^2=0.104 p=0.001, and a Mantel rho=0.94 p=0.001.