I have a 16S rRNA and an 18S rRNA dataset for the same human faecal samples. The main hypothesis of our study was to identify differences in microbiome composition (bacterial and eukaryotic) between patients and healthy controls. However, it looks like the two groups are not different. Something else that we are interested in is whether the eukaryotic organisms identified in the 18S data are associated with any bacterial taxa in the 16S data.
There are few enough organisms in the 18S data that I was able to include their relative abundance as columns in the metadata file, and I coloured my 16S rRNA beta diversity PCoA plot by the abundance of the eukaryotes detected in the 18S data (continuous gradient colouring). Viewing each one, I identified one organism (Blastocystis) that exhibited a pattern - samples that contained no or low abundance of this eukaryote were on the left side along PC1, and higher abundance samples were on the right side.
So the bacterial composition appears to be different with differences in Blastocystis abundance, and I’d now like to know which particular bacterial taxa vary. I started with an ANCOM analysis to identify differentially abundant taxa between samples with low or no Blastocystis abundance and those with high Blastocystis abundance (there was a big jump in the abundance from low to high so this felt like an appropriate split). There were two bacterial families that were significant, and I’ve made some nice boxplots of their abundance in the high/low Blastocystis groups.
I am considering also trying a correlation analysis of some sort, to identify bacterial taxa whose abundance correlates with Blastocystis, which might reveal more than splitting the samples up into groups. I have two questions:
Has this become data-dredging? I have looked through several variables (the 18S-measured organisms) and identified one where I can visually observe a pattern on the PCoA. I am then statistically testing this variable. However, I’m not sure how else to tease out what this exploratory observation actually is (i.e. I’ve observed an association between Blastocystis and the bacterial composition, but which bacteria are actually different?)
I’ve looked at the plugins for
bioenv, and my understanding is that these methods will tell me if there is a correlation between a continuous variable and the bacterial composition, but they don’t answer my question of what organisms are different as Blastocystis abundance goes from low to high. I am somewhat hesitant to do further statistical tests on something that was not an a priori hypothesis (see point 1) but I am not sure how else to tease out exactly which bacteria are associated with the abundance of Blastocystis - which is part of my exploratory observation, rather than something that I want to statistically test. How might I do this - what can be done to understand what’s behind a pattern observed in the PCoA?
Would be grateful for any advice you can give me.