When to collapse (the data, not yourself)

devonorourke · April 5, 2019, 4:15pm

I've been applying the classify-samples scripts (thanks @Nicholas_Bokulich !) with some bat guano samples and was wondering about the best strategy in moving forward with interpreting these data.

My data is structured such that there are 2 factors (Site and Date). I've looked at modeling effects for the factors individually and in combination, and by including the --p-optimize-feature-selection parameter, I've been trying to figure out which ASV features are most relevant for the model to discriminate among the factor(s).

In trying to figure out which ASVs are most important, I applied a cummulative sum function to the importance field (from the feature_importance.qza output) and find that in general there are usually between 30 to 80 ASVs that provide something like 50% of the overall predictive power. As one example, just 39 ASVs comprise 50% of the "importance" when discriminating by Month. If I investigate those ASVs in particular, and examine how many samples they are detected in for a particular month, I notice that there are some interesting dynamics going on depending on the Order-rank the ASV is associated with. The following figure is coloring each ASV according to it's Order-rank, with each dot/line representing a particular ASV. All "Dipteran" ASVs are labeled on the left, all non-Dipteran Orders are labeled on the right:

Lots of things moving around! But I started looking at just what those ASVs represented. Instead, now I'll plot the exact same data, but apply the Genus-rank information to each label instead of the ASV alias used previously:

Notice on the left hand side, how among all those Dipterans, there's a fair bit of redundancy? Those pesky Culex and Aedes mosquitoes, for instance? Or on the right hand size, with the Gyponana leaf hoppers? ... This got me to wondering about collapsing these data. Given that several of these ASVs share common Genus (they also nearly always share the same species name too), I collapsed the data to that shared Genus; the plot is much neater:

If you're still following along (thanks !) this got me wondering whether this kind of post-hoc collapsing was the appropriate way to analyze trends, or, if I should have collapsed my data on the front end to feed the machine learning model. On one hand, collapsing to Genus-level ahead of time can (and likely will) discard many instances in which unique species of insects with shared Genus are lumped together; this would further obscure the data.
On the other hand, keeping things at the ASV-level can lead to instances in which two ASVs differ between Sites, or Dates, and yet those ASVs are essentially the same sequence with perhaps a single nucleotide variant across the ~180 bp amplicon of interest. Doing it this way provides the flexibility for those 'just so' cases, but then, well, it's doing the kind of personal-selection thing I try to stay away from.

Thanks for any strategies you might offer!

Nicholas_Bokulich · April 8, 2019, 12:26pm

Sweet plots @devonorourke!

I would personally avoid clustering on the front or back end. If you do it at either end, do it on the front and risk losing important information... but let the model tell you that! It is certainly interesting to test whether you receive similar (or more?) predictive power with taxonomy-collapsed features compared to ASVs.

But that variation could be important. I'd say it's troublesome if you suspect these are the exact same species and exact same individual insects but multiple copy # variation is making these slight variants covary almost perfectly (e.g., those culex ASVs), but that is not really troublesome at all from the standpoint of these models... just for interpreting whether you really have 3 or 4 distinct culex species/subpopulations or if these are all one in the same.