when things look good, I wonder where I've gone wrong...

devonorourke · July 15, 2019, 2:15pm

I've got a pile of amplified arthropod COI data from about 600 bat guano samples. My typical workflow includes testing out a few different distance metrics like Dice-Sorensen and Bray-Curtis to see how read abundances influence the results, as well as weighted and unweighted Unifrac to see how adding phylogenetic information skews the resulting ordinations (and permanova). The weighted unifrac data tends to have more variance explained in the first two PCs, but the data isn't necessarily quite as spread out as other distance estimates.

Samples come from one of 9 different locations, collected weekly from April through October. We collected about 10 samples per week, but the sequencing results were less than inspiring. The dataset used to create the plot below was filtered such that:

All samples have at least 1000 non-rarefied reads (notably, these are 1000 arthropod COI reads as contaminants like the host (bat) DNA has been removed prior to assessing those counts)
Any sampling window at a given site has to have at least 2 samples (no singlet samples per site/date)

I took a crack at my first biplot to see if there were any interesting features that might project well in the weighted unifrac dataset, so I applied @jbisanz's code in this thread to make this biplot:

Each sample location is a text/point on the main plot, colored by the time of the season it was collected. You'll notice that the colors are kind of all over the place, but there is this one outlier in the top right of the main plot with a bunch of points labeled 'HOL' (it's the Squam Science Center in Holderness, NH - home of the most adorable river otters!). The inset plot shows the species projections that match the main plot, but these are colored by the arthropod Orders. You might notice the two dark green ones (ASV-15 and ASV-17) are projecting towards that HOL cluster, which is sensible because these are aquatic bugs, and that location happens to be right next to a big lake, whereas other sites don't have the same scope of aquatic landscapes.

I was totally happy with this plot and ready to start writing up my paper, but I decided to give the recent DEICODE plugin by @cmartino a spin and see what might come of it. I was curious if using non-rarefied reads might change the interpretations of what's going on. And the answer is... sort of?

In this case, my take home message would be a bit different: sites tend to be spread out based more on the earlier parts of the sampling period, whereas when you get into summer time, there's a lot of compositional similarity; as you move into the very edges of the sampling period, some sites become different again. A new outlier cluster emerges in this ordination - FOX (Fox State Forest, Hillsborough, NH - visit if you like foliage and bugs!), while the earlier outlier at HOL is less of an outlier in this instance as it was in the earlier plot.

What jumped out to me were the axes labels - the PC % variance explained is off the charts in the DEICODE ordination; it's larger than any other value I've ever come across. And that's where I'm getting a bit suspicious, and worried about some assumptions that I'm not meeting that could be skewing these calculations.

Unlike microbiome data, there aren't nearly as many features per sample in these bat guano samples - typically between 10-30 ASVs per sample (though these samples are particularly low in richness compared to other guano datasets I've dealt with). The code executed to generate the biplot .qza file was as follows:

qiime deicode rpca \
    --i-table $DEICODETABLE \
    --p-min-feature-count 2 \
    --p-min-sample-count 1000 \
    --o-biplot deicode_biplot.qza \
    --o-distance-matrix deicode_distmat.qza

Initially I wondered if having such a low --min-feature-count parameter was an issue, so I tried running the same code but upped that parameter from 2 to 10 (probably doesn't sound like much to most microbiome datasets, but for guano data that actually is a bigger jump than it seems).

The resulting plot shifts yet again, this time actually spreading out those two plots worth of outliers: FOX and HOL are now both outliers in their own respective directions. So maybe the minimum filtering thing matters?

I'm curious to hear from other people's experiences what parameters they've played around with, and what seems to be more or less sensitive. This thread suggested that I might need to be concerned with the fact that my data is probably working across a few gradients (space and time):

However, I'm getting such high % variance explained with the defaults that I'm starting to wonder if I need to adjust the --p-rank parameter, or if I'm already overfitting with the three axes being used in the decomposition of the data.

Any insights would be greatly appreciated. Thanks for the great tool!

cmartino · July 15, 2019, 3:18pm

Hi @devonorourke,

You are correct in lowering the minimum count sum perameters. In this case lowering them to 1 or zero may even be warranted.

The DIECODE percent variance explained is a relative percentage across each axis (in this case only three). For example, if the rank is two there will only be two axis and they will sum to 100% of the variance. This can make the % variance increase dramatically over other methods with many PCs. So I would not compare it between methods (like PCoA) and only compare it between other axis in the biplot.

As far as the gradients go, a clear sign you are dealing with a gradient is if you see a horseshoe affect in other methods. (See this paper for a more thorough explication of that). From what I see on your UniFrac plot that is not a problem.

Also be sure to update to the most recent version of DEICODE (v0.2.3). There was an issue with the loadings centering in v0.2.2.

I hope this helps,

Cameron

devonorourke · July 15, 2019, 3:39pm

Thanks for the quick reply @cmartino,

I think I'm misunderstanding what the appropriate tweak here would be. I thought I needed to increase this value to account for weird samples that have very few features. Are you suggesting I remove that filter altogether?

Would the only fair comparison between the PCoA result and this PCA result be to examine the relative proportions between the same two PC's being ordinated? For example, if the PCoA shows PC1 == 38% and PC2 == 7%, that's about a 5:1 difference, whereas with the first PCA plot (using the exact same data as the PCoA) I get 83% for PC1 and 14% for PC2, which is about a 6:1 raio.

Not sure what I can say about comparing these two methods; perhaps understanding how much variation is explained in the first two PCs for PCA isn't possible with the DEICODE method?

I agree that there isn't a particularly troubling horseshoe pattern in the plot I presented, but I bet you could find one among the other three distance methods I tested (all on the same dataset)...

I installed the program with conda install -c conda-forge deicode this morning and can confirm with conda list that I'm running v-0.2.3, thanks! I wonder if future versions of the program could implement a deicode --version call, or display the version when you enter deicode --help?

cmartino · July 15, 2019, 5:04pm

Good questions @devonorourke.

The total sum sequence cut offs have two main purposes. The sample sum cut offs ensure very undersequenced samples (something like less than 100 sequences total) are removed. The feature sum cutoffs ensure taxa that appear very sparsely (just a few total counts) across all samples are removed, because they are often contamination in very diverse samples. In your case I think the low diversity would motivate keeping all the features/samples.

I think the ratio comparison makes more sense but I would still proceed cautiously. I would motivate the use of DEICODE more on it's ability to cluster out interesting biology. Furthermore, you can use the ordination from DEICODE to generate log-ratio plots with Qurro thanks to @fedarko. This should allow you to directly relate log-ratios from arrows in the biplot to metadata. Which is a nice addition to the ordinations.

For the gradient issues. You can also use compositionally coherent methods like ALDEx2 or songbird which give similar feature rankings. You can use time and location as variables in models using those tools.

Good point. I will be sure to add a version command. See the now open issue.

(Sorry for not linking questions between responses, responding via my phone)

devonorourke · July 15, 2019, 5:57pm

Got it. I've done a bit of testing using a few of my datasets that had mock communities and I can attest to unexpected ASVs popping up in low abundances in these control samples. It'll be fun to explore how adjusting the minimum sequences total threshold would change my outcomes, particularly because there are a fair number of samples I could recover that had something like 750 sequences instead of the 1000. I'm terrified of going below 1000 for no sensible reason other than trying to avoid a reviewer's wrath.

Thanks for the recommended new tools, but I'm holding off on those until I graduate in (fingers crossed) 28 days. This is very likely not the last time you'll hear from me

Thanks for the cell phone reply!

system · August 15, 2019, 11:57pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.