I've been looking into how the final count tables differ depending on which samples are processed with Dada2 at the same time. From my analysis, it looks like the final ASV and genus count tables are changed depending on the collection of samples processed together.
All the samples I will be working with are from the same batch (all were processed and sequenced together) since Dada2 should be run independently on each batch.
The workflow consisted of the following steps:
- Randomly generate processing groups consisting of 2-25 samples.
- Process these groups separately with the following steps:
- Dada2 -> closed-reference OTU clustering to SILVA132 -> Taxonomy classification with Naive-Bayes classifier
- TSS normalize each sample
- For each sample, calculate the Bray-Curtis distance between all pairwise combinations of processing groups it was in.
- Plot Bray-Curtis distances between each processing group the sample was in.
I used this workflow to independently look at two different batches of samples (samples from different batches were not mixed) For the two batches, each sample was processed around 10 times with a different collection of other samples.
The following boxplots show the Bray-Curtis distance between a single sample processed several times. I looked at the Bray-Curtis distance of the ASV and Genus table.
For batch 1, it is clear that the final counts for some samples were quite different depending on the collection of samples it was processed with. For batch 2, the effect was less drastic with only 2-3 samples displaying differences between processing groups.
Furthermore, the differences are a lot more prominent at the ASV level than the Genus level. This makes sense to me, as at higher taxonomy ranks we are going to start collapsing a lot of those finer details into the same groups.
My current reasoning for this effect has to do with the fact that Dada2 only learns the error rates for the first few samples that get it to 1e8 base pairs. In each of the processing groups, it is likely different samples being used to learn the error rates and therefore changing the denoised ASVs generated by Dada2.
Any insights or suggestions as to these results is greatly appreciated!