I have a few follow ups, and am working on some visualizations to share this morning. Hopefully we can continue this conversation as the day unfolds if people have interest.
In my earlier question about read depth filtering, I had posted a per-library plot that suggested several instances in which sample read depth followed a (fairly) bimodal distribution.I was curious about what it looked like after DADA2 filtering was implemented; the following plot shows a comparison. Apologies when looking between the two figures as the earlier post didn't contain every library, and this one contains all 12 runs:
My takeaway is that DADA2 performed as expected. About 100 of the initial 4,600 samples were dropped entirely, as these contained low numbers of total sequences per sample, and these were likely either chopped because (a) they had too few to even be considered by DADA2 and were discarded on the front end, or (b) contained sufficient read depth, but were of low quality or chimeric, and tossed on the back end of the pipeline.
What's clear is that the bimodal distributions of some libraries remain, though perhaps it's not as pronounced now.
So what gives!? @Nicholas_Bokulich asked earlier about whether my samples were split in terms of biomass, and the short answer is no, but the long answer is I have no idea. Why? Because I'm dealing with bat poop, so I have no really good way of determining the arthropod biomass in a given bat turd. It probably varies, but to what extent, I haven't the slightest clue. There is decades of previous work describing using tweezers and microscopes dissecting out the arthropod parts, and folks have reported sample biomass (or volume) in those samples, but I'm not sure how variable it is. Nor do I think the reported biomass has much bearing on what gets amplified - that's why we're doing the molecular stuff, right?
The biomass question led me to wondering about what the read depth distributions are like when thinking about them on a sequence variant level, rather than a per-sample level.
The following plot represents the entire dataset aggregated into one pool (though remember, all samples were sequenced across 12 different runs). It's summarizing the read depth per ASV rather than per sample. And boy, do I have a lot of infrequently observed ASVs.
So the x-axis represents the number of reads attributed to a particular ASV, summed up among all samples in the entire dataset. This is to say the vast majority of my ASVs are showing up less than 1000 times. As there are about 4,500 samples which passed DADA2 filtering, that's pretty amazing to me.
I wonder what folks see in their microbial datasets? How rare is rare?
Okay, now last plot for this post. Same as the above per-ASV plot, but subsetted (faceted?) by sequencing run.
I like this, in that it shows me that rare ASVs aren't creeping into my dataset as a function of some wonky MiSeq run or two. Oh, and the run for library 8.1 sucked.
So, at this point I haven't addressed anything that @jwdebelius mentioned/suggested regarding the checkboardness. I haven't addressed anything that @Mehrbod_Estaki wrote in terms of my original point of this thread - private ASVs.
Nevertheless I think these plots begin to get after my overall motivation - trying to figure out what to keep in a dataset. The point of the private ASVs is to also inform filtering parameters.
More to come. Excited to hear any comments you might have.