I’d like to get a sense from community members whether my following two choices are warranted, or extremely dumb. It’s an extension of a post already asked, when thinking about how to utilize technical replicates.
However, in my situation, the samples aren’t known to be replicates, but could be. Further, some of the advice in that post describes leveraging relative abundances in their distance methods, but in my case I don’t suspect that will necessarily help (and may actually be problematic).
So, the problem:
I deal with COI amplicons, sequenced from arthropods DNA in bat guano. Guano pellets were collected and sequenced individually - one piece of poop == one set of PE fastq files. Crucially, multiple samples were collected from the same location at the same date. Also important, there were multiple sites in which samples were collected on the same dates. You can imagine the dataset being organized something like this:
| Sample | Site | Date | 01 | Texas | April-1 | 02 | Texas | April-1 | 03 | Texas | April-1 | 04 | Texas | May-20 | 05 | Texas | May-20 | 06 | Maine | April-1 | 07 | Maine | April-1
My problem is that my guano collection happened passively - we went to a site with a piece of plastic, bats randomly pooped on it over the course of a week, and we placed an individual guano pellet from that pile in a single tube (and did that 10 times at a site on a given week). These pellets could come from 10 unique bats, or one prodigious pooper (in case you’re curious, no, I have no plans on using other molecular techniques to assess individuality, because there are about 3000 samples in total I’d have to resequence).
Because I can’t attribute bat individuality per sample, it strikes me that the best way to treat samples is to collapse all data to the unique Site+Week. In the above data table, I’d group Samples
01-03 together, for example. Samples
04-05 are another distinct Site+Week group, and Samples
06-07 are yet another Site+Week group to combine.
Does that seem like a really dumb idea to others?
The motivation behind combining datasets is to address the questions of:
- (alpha diversity) Does abundance change between Site and/or Week (ie. do bats eat more or less things at different locations or dates?)
- (beta diversity) Is community composition associated with Site and/or Week (ie. do bats eat different things depending on where and/or when they forage)?
The median number of sequence variants detected in a single guano sample is about 20 (but varies anywhere from 1 to about 150). Thus this dataset is likely similar to a low biomass microbial community. However, unlike microbial communities, I don’t have the luxury of using relative abundance measures to include in calculating distances. Because I’m limited to presence/absence metrics to address these questions, and because a given sample has generally few ASVs, but also because I have over 2,500 samples in total, I basically have an OTU table filled with 0’s, sprinkled with a few 1’s just to keep me guessing.
Combining samples sequenced at similar Site+Weeks will hopefully reduce the sparseness of the OTU matrix substantially and hopefully help me hone in on a signal.
Another option is to collapse taxonomic information with shared groups - say, merge any ASV that has the same taxonomic information assigned through to the Genus level. This would further collapse my OTU table, just in the other dimension. That’s perhaps a totally different topic, but within the arthropod diet community I’ve seen it been done before so it’s not like it’s without precedent. It also comes with it’s own caveats and warnings, I understand, but was curious if other microbiome users have done this, and what advice/critiques they might offer.
Looking forward to your responses!