Selecting and merging .gz files from multiple sequencing runs


My question, in a nutshell, is: When merging data from 2 or more sequencing runs or analyzing only a subset of samples from a run, is it problematic for me to manually select the fastq files of interest, organize them in one folder and then proceed processing through QIIME2?

I am working on some 16S rRNA V4 region amplicon datasets sequenced from Illumina MiSeq and QIIME2 in conda environment.

I have raw fastq files from two separate MiSeq runs with overlapping barcodes, and I wanted to combine these samples into one for analysis.

Several days of searching led me to the FMT tutorial, which features an example of processing two sequencing runs.
Here it processes the data from each run separately until the denoising step and merges them after.

I was wondering, will it be a problem if I just put the raw sequencing files (in a format of .gz) of my samples of interest and process the data in QIIME2 if they were from a single seq run?

Also, if I am interested in only a few samples from a sequencing run, will it be necessary to follow the filter-samples method as stated here, instead of manually picking .gz files?

This is my first time dealing with metagenome data and my question might be too basic,
but any insight or comments would be of great help.
Thank you.

Good morning EJ,

Welcome to the forums! :qiime2:

Good question. The reason samples are split per run, is because DADA2 is being used for denoising and DADA2 builds an error profile based on that Illumina run.

You could combine these runs before denoising, but then how would you detect per-run batch effects? If you process each run separately, you could measure (and maybe even control!) these batch effects in downstream analysis.

This works, but you would need to denoise each cohort of samples separately. If you denoise each run, then combine, you could easily pull out multiple cohorts of samples from that one merged table. Both methods should work, but if you have 2+ cohorts, then the recommended method should be easier.

Let us know if you have more questions about how to do this.

Thank you, @colinbrislawn !
This was exactly what I was curious about.

So running each run separately would allow me to analyze whether batch effects exist in my data.
I found some old sequencing data I needed but it was analyzed with pre-combined runs, and was not sure if these data were adequately interpreted.
I guess my next approach should be analyzing the raw data separately, seeing whether there is a batch effect or not, and if needed, finding a way to remove it.

I'll try checking, testing, and finding a proper way to merge my data.
Again, thanks a lot for your great support!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.