I have a 16S rRNA and an 18S rRNA dataset for the same human faecal samples. The main hypothesis of our study was to identify differences in microbiome composition (bacterial and eukaryotic) between patients and healthy controls. However, it looks like the two groups are not different. Something else that we are interested in is whether the eukaryotic organisms identified in the 18S data are associated with any bacterial taxa in the 16S data.
There are few enough organisms in the 18S data that I was able to include their relative abundance as columns in the metadata file, and I coloured my 16S rRNA beta diversity PCoA plot by the abundance of the eukaryotes detected in the 18S data (continuous gradient colouring). Viewing each one, I identified one organism (Blastocystis) that exhibited a pattern - samples that contained no or low abundance of this eukaryote were on the left side along PC1, and higher abundance samples were on the right side.
So the bacterial composition appears to be different with differences in Blastocystis abundance, and I’d now like to know which particular bacterial taxa vary. I started with an ANCOM analysis to identify differentially abundant taxa between samples with low or no Blastocystis abundance and those with high Blastocystis abundance (there was a big jump in the abundance from low to high so this felt like an appropriate split). There were two bacterial families that were significant, and I’ve made some nice boxplots of their abundance in the high/low Blastocystis groups.
I am considering also trying a correlation analysis of some sort, to identify bacterial taxa whose abundance correlates with Blastocystis, which might reveal more than splitting the samples up into groups. I have two questions:
Has this become data-dredging? I have looked through several variables (the 18S-measured organisms) and identified one where I can visually observe a pattern on the PCoA. I am then statistically testing this variable. However, I’m not sure how else to tease out what this exploratory observation actually is (i.e. I’ve observed an association between Blastocystis and the bacterial composition, but which bacteria are actually different?)
I’ve looked at the plugins for mantel and bioenv, and my understanding is that these methods will tell me if there is a correlation between a continuous variable and the bacterial composition, but they don’t answer my question of what organisms are different as Blastocystis abundance goes from low to high. I am somewhat hesitant to do further statistical tests on something that was not an a priori hypothesis (see point 1) but I am not sure how else to tease out exactly which bacteria are associated with the abundance of Blastocystis - which is part of my exploratory observation, rather than something that I want to statistically test. How might I do this - what can be done to understand what’s behind a pattern observed in the PCoA?
Hi @rachaellappan,
I would recommend checking out q2-SCNIC or another tool (like spieceasi, not yet implemented in QIIME 2) for doing correlation/network analysis. These are designed for this type of comparison, relieving some of your concerns about data dredging and the appropriateness of the tests you use.
Since you are specifically interested in Blastocystis, you could also make a PCoA biplot as described here:
But SCNIC or speaceasi would allow you to identify more connections between the 16S and 18S data.
I generally would not recommend using sparCC as a correlation method when correlating between data types. I don’t have any experience with 18S sequencing but if the biases and sequencing techniques tend to be the same as 16S you could try concatenating your 16S and 18S tables. Then use that concatenated table as input to SCNIC’s within mode with sparCC as the correlation metric. I’m not quite sure how having an entirely nonoverlapping set of features would affect the sparCC metric but my hunch is that it should still work. You would have to try to read more into the sparCC paper or contact the authors to dig into that in more detail.
Alternately when I correlate between different data types (usually 16S and LC/MS or NMR metabolomics) I will build correlation networks within each data type using an appropriate metric (e.g. sparCC or 16S and spearman for LC/MS metabolomics) and then correlate between the data types with either spearman or pearson correlation. Then these three networks can be merged using something like cytoscape or networkx, if you like working in python.
I’m hesitant to concatenate the 16S and 18S datasets - my 18S dataset has substantially fewer sequence reads and fewer features than the 16S, so I’m not sure if this is an appropriate approach when SparCC estimates the fraction of each feature. I will ask Yonatan Friedman what he thinks.
My only concern with using Spearman to correlate between them was that both datasets are sparse and compositional so I didn’t think this would be accounted for - but perhaps the compositional aspect is less of an issue when correlating between datasets?
I emailed Jonathan and just wanted to share his thoughts here for anyone who is interested. Correlating compositional data is quite a problem!
I generally wouldn’t recommend just concatenating the datasets.
This violates the assumption of multinomial sampling used in SparCC (and similar assumptions made by every other tool for compositional analysis that I’m familiar with).
However, this may not be so bad if your samples are diverse.
In fact, if your samples are diverse enough, standard correlation measures will perform well. See here for some benchmarks.
Unfortunately, I don’t think there are good benchmarks for correlations between datasets.
An alternative is to first transform both datasets to a euclidean space, and then use standard technique to correlate between the transformed variables.
The PhILR transform seems very promising for that purpose. I think QIIME2 also has something similar implemented through gneiss.
This approach has the big advantage that the new variables are no longer comportional, and are orthogonal, so you can safely use all the standard statistical analysis techniques.
The downside is that interpretation of the results is more changeling, as they involved the transformed variable. In PhILR, this issue is somewhat mitigated as the new variables (called balances) are ratios of taxonic groups, and thus still have pretty clear biological meaning.
TL;DR
Concatenating the data sets, or correlating between them using spearman may be fine, but it’s hard to tell exactly how well it would work. You could try it and compare will the results obtained for appropriately shuffled data.
Transforming your data will be more statistically sound, but interpretation of results is not as straightforward.