I was asked to help analyzing an existing microbiome project; however, the sequencing was done in a weird way. Rather than running all samples on a single MiSeq run, they were split across multiple runs. But not by running whole samples on different runs, but by running all samples across three spike-in runs. So for each sample, I have reads from three different runs.
I was planning to analyse the data using DADA2 as usual, but now with having three runs per sample I'm not so sure anymore. I know that separate runs should be run separately through DADA2 due to error modeling. I know how to merge whole samples from different runs, but I have no idea how I could merge separate samples.
Is there a way to save this project? Any suggestions on how to analyse the data?
Yeah, you still can analyse it. As I understood, it is the same library, that was sequenced 3 times. You may process all three libraries by dada2 and:
a. Assign taxonomy and choose one with best results based on spike-in ratios
b. Merge output files with summarizing reads by samples in one feature table (all dada2 parameters should be identical in that case). In that way each sample will contain features from all three replicated runs.
PS. Did they add DNA of some bacteria as spike-in?
Sorry for the late reply; I got distracted with a different project.
Yes, that's correct. One library consisting of 24 samples, which were sequenced 3 times as "spike in". I think there was a small misunderstanding. The term "spike in" is what the sequencing service used, and they are referring to doing a normal high-diversity MiSeq run, and spiking a small amount of the low-diversity 16S library into the run. Unfortunately, this is not giving us the number of reads we wanted per sample, hence the three runs now.
I don't think a is an option for us, since we prefer having more reads per sample.
b sounds interesting though. So I would run DADA2 separately on each sequencing run and then merge the output files? Would that merge the individual reads for each ASV for each sample? I've merged separate runs before, but only ones where each sample was unique to either of the runs. I've never actually merged split samples.
No DNA was added into the libraries. As I mentioned above, the spike in referred to spiking the final library into another MiSeq run. (which I think is ok, but I'm not sure why they would have split the same library across three runs, which was not communicated properly to us beforehand).
Thank you for clarification! That's make sense now.
That's right. You can denoise all three datasets with the same parameters separately and then merge outputs.
On merging feature tables step, default option is to rise an error on duplicated samples. Providing 'sum' for
--p-overlap-method will override this behavior and pool all features from duplicated samples in one, and summarize all identical features in that sample. So at the end you will have more reads than in each of the runs separately.