I’ve got a pile of amplified arthropod COI data from about 600 bat guano samples. My typical workflow includes testing out a few different distance metrics like Dice-Sorensen and Bray-Curtis to see how read abundances influence the results, as well as weighted and unweighted Unifrac to see how adding phylogenetic information skews the resulting ordinations (and permanova). The weighted unifrac data tends to have more variance explained in the first two PCs, but the data isn’t necessarily quite as spread out as other distance estimates.
Samples come from one of 9 different locations, collected weekly from April through October. We collected about 10 samples per week, but the sequencing results were less than inspiring. The dataset used to create the plot below was filtered such that:
- All samples have at least 1000 non-rarefied reads (notably, these are 1000 arthropod COI reads as contaminants like the host (bat) DNA has been removed prior to assessing those counts)
- Any sampling window at a given site has to have at least 2 samples (no singlet samples per site/date)
I took a crack at my first biplot to see if there were any interesting features that might project well in the weighted unifrac dataset, so I applied @jbisanz’s code in this thread to make this biplot:
Each sample location is a text/point on the main plot, colored by the time of the season it was collected. You’ll notice that the colors are kind of all over the place, but there is this one outlier in the top right of the main plot with a bunch of points labeled ‘HOL’ (it’s the Squam Science Center in Holderness, NH - home of the most adorable river otters!). The inset plot shows the species projections that match the main plot, but these are colored by the arthropod Orders. You might notice the two dark green ones (ASV-15 and ASV-17) are projecting towards that HOL cluster, which is sensible because these are aquatic bugs, and that location happens to be right next to a big lake, whereas other sites don’t have the same scope of aquatic landscapes.
I was totally happy with this plot and ready to start writing up my paper, but I decided to give the recent DEICODE plugin by @cmartino a spin and see what might come of it. I was curious if using non-rarefied reads might change the interpretations of what’s going on. And the answer is… sort of?
In this case, my take home message would be a bit different: sites tend to be spread out based more on the earlier parts of the sampling period, whereas when you get into summer time, there’s a lot of compositional similarity; as you move into the very edges of the sampling period, some sites become different again. A new outlier cluster emerges in this ordination - FOX (Fox State Forest, Hillsborough, NH - visit if you like foliage and bugs!), while the earlier outlier at HOL is less of an outlier in this instance as it was in the earlier plot.
What jumped out to me were the axes labels - the PC % variance explained is off the charts in the DEICODE ordination; it’s larger than any other value I’ve ever come across. And that’s where I’m getting a bit suspicious, and worried about some assumptions that I’m not meeting that could be skewing these calculations.
Unlike microbiome data, there aren’t nearly as many features per sample in these bat guano samples - typically between 10-30 ASVs per sample (though these samples are particularly low in richness compared to other guano datasets I’ve dealt with). The code executed to generate the biplot
.qza file was as follows:
qiime deicode rpca \ --i-table $DEICODETABLE \ --p-min-feature-count 2 \ --p-min-sample-count 1000 \ --o-biplot deicode_biplot.qza \ --o-distance-matrix deicode_distmat.qza
Initially I wondered if having such a low
--min-feature-count parameter was an issue, so I tried running the same code but upped that parameter from 2 to 10 (probably doesn’t sound like much to most microbiome datasets, but for guano data that actually is a bigger jump than it seems).
The resulting plot shifts yet again, this time actually spreading out those two plots worth of outliers: FOX and HOL are now both outliers in their own respective directions. So maybe the minimum filtering thing matters?
I’m curious to hear from other people’s experiences what parameters they’ve played around with, and what seems to be more or less sensitive. This thread suggested that I might need to be concerned with the fact that my data is probably working across a few gradients (space and time):
However, I’m getting such high % variance explained with the defaults that I’m starting to wonder if I need to adjust the
--p-rank parameter, or if I’m already overfitting with the three axes being used in the decomposition of the data.
Any insights would be greatly appreciated. Thanks for the great tool!