How to Analyzing Multi-Run Sequencing Data in QIIME2 with Minimal Batch Effects

Hi QIIME2 experts,

I have datasets from three sequencing runs:

  • One from an Illumina MiSeq machine
  • One from an Illumina NextSeq machine
  • One from an older version of the Illumina MiSeq machine

All datasets consist of paired-end 150 bp reads.

I have two questions regarding the analysis in how to analyze these data via QIIME2:

  1. Can these datasets be used as they come from different sequencing machines?
  2. Can I analyze these datasets using DADA2 or Deblur in QIIME2?
  3. How can I minimize batch effects when analyzing these three sequencing runs together?

Iā€™d appreciate any insights or best practices for handling potential batch effects during preprocessing and downstream analyses.

Thank you!

Best regards,
Brandon

Hello!

Short answer:
Yes!

Long answer:

Yes, sometimes we need to analyze such data!

I hope that you are working with V4 region and not longer one, since otherwise there may be some difficulties with merging the reads, independently of the machine/technology used.

Yes, they can!

You can use both of them by your choice.

You need to denoise each sequencing run separately but with the same settings to avoid introducing the biases by providing different settings, and messing with error models in Dada2.

My approach would be:

  1. Import each run separately
  2. Remove primers (the same primers, right?)
  3. Run Dada2 for each run (with the same settings, try to find optimal parameters for all three).
  4. Merge representative sequences and feature tables.
  5. Keep run info in the metadata file to be able to trace the batch effect or to account for it in stat. analyses.

Hope that helps.

Best,

2 Likes

Thank you, @timanix for the quick and incredibly helpful response!

Yes, I'm working with the V4 region, not the longer ones. I really appreciate your clarification on minimizing technical batch effects, especially when dealing with different sequencing runs and platforms. Your suggestions are extremely valuable.

One more question to better understand your recommendations: Since technical batch effects are common with different sequencing techniques, and your approach helps to minimize them, do you know of any publications using the methods you mentioned that merge datasets from different sequencing machine models? I would like to reference them in the Methods section of my future publication.

Again, thank you so much for your assistance!
Best,
Brandon

1 Like

Unfortunately, I don't have any paper in mind that I can reference to.

I am not aware of any paper that tested such approach or used it.

I used this approach myself in two projects, but both are at the writing stage.

All recommendations I listed were read by me earlier on this forum, as recommended by other mods and Qiime2 /Dada2 developers.

@q2-mods, please share the papers if you have something on it .

Best,
Timur

Thank you @timanix, so helpful!!! I am looking forward to the references from the @q2-mods.
Best,
Brandon

Hi @Brandon
Following on the great answer from @timanix.
I would say the main point is to highlight if there is any batch effect in your data because the different runs (well .. we can assume it will be there) by tracking samples from each run in the sample metadata. Then you can manage by applying statistics that can answer your biological question by considering it is masked behind the batch effect.
On paper I was curious what is out and I found this nice review:

Which include all the analysis step and a full paragraph on batch effects and statistical methods to manage it.
Cheers
Luca

3 Likes

Thank you @llenzi . So helpful!!!
Best,
Brandon