How to deal with multiple runs for one sample?

Ruitao_Liu · February 13, 2025, 6:36pm

Hi,
I'm encountering an issue when importing FASTQ files downloaded from NCBI into QIIME2. For some samples, there are multiple runs available. When I try to import these files using a manifest file, I receive an error about duplicated sample names.

I'm wondering what the best approach is for handling these cases. Should I:

Concatenate the multiple runs for each sample into a single FASTQ file,
Select only the largest FASTQ file among the runs, or
Use another method to deal with this issue?

Thanks!
Ruitao

gregcaporaso · February 17, 2025, 5:42pm

Hi @Ruitao_Liu,
If you want to combine the sequence data from the two runs, the path that probably makes the most sense here is to import the runs separately, get them through DADA2 to feature tables, and then merge the feature tables using qiime feature-table merge --p-overlap-method sum. That will ensure that DADA2 works as expected (it expects to receive data from only a single run at a time), and the per-sample vectors in the resulting feature tables will be added together during merge to create your final feature table. This workflow is roughly outlined here, though in that case the two runs didn't have overlapping samples so the merge command did not include the --p-overlap-method sum parameter.

It sounds like combining the sequence data is your goal, but you also might consider whether using the data from only one of the two runs makes more sense - for example, if the sample was resequenced because it had a low read count in one of the runs, it might make sense to only use the data from the second run for that sample. In that case, you'll still want to process the runs independently, but you could remove lines from the manifest file for the samples that you don't want to include.

Ruitao_Liu · February 17, 2025, 8:46pm

Hi @gregcaporaso,
Thank you for your response. I plan to try this method. When importing multiple runs for a single sample and creating the manifest file, how should I specify the sample names for each run that belongs to the same sample?

gregcaporaso · February 17, 2025, 9:04pm

Hi @Ruitao_Liu, You'll create one manifest file per sequencing run, and then import each into its own artifact. So, if you have n sequencing runs, you'll have n manifest files, and you'll run import and dada2 denoise-* n times each.

Ruitao_Liu · February 17, 2025, 9:56pm

Thanks very much @gregcaporaso for your patience, I will try to create one manifest file per sequencing run.

colinbrislawn · February 18, 2025, 2:59am

Greg has already shared how to merge samples across runs, so I'll share how to not merge samples and why keeping technical replicates may be appealing.

Each sequencing run presents a new statistical sampling of the microbial composition of a sample, and it's great to get more data!

These new runs can also introduce batch effects from that labs processing the samples, or carry with them biological batch effects like changes over time. If you keep the sample names separate, like this
sample1-run1
sample1-run2
sample1-run3
you can also add a new metadata category called run_number and use that in your statistical testing and modeling.

qiime diversity adonis --p-formula run_number+location ...

This formula would first partition out variance attributable to run_number, then tests for the variance attributable to location.

By controlling for sequencing run in this way, you can justify to reviewer 3 that your results are robust across sequencing runs!

I think this is just as much work (or more!) as merging / summing up samples. It can also help detect the presence and magnitude of batch effects which could otherwise cover up biological signals in your data set.

Thank you for bringing this question to the forums!

Ruitao_Liu · February 18, 2025, 2:22pm

Thanks @colinbrislawn, for showing another method to deal with multiple runs. Just to clarify, are you suggesting that I can import all sequencing files at once if I rename them to include a run identifier (for example, sampleX-runY), and then, in downstream analyses, add a new variable to capture the run number in my sample data?

colinbrislawn · February 18, 2025, 2:45pm

Yes, this is how it would work if you denoise your reads with deblur.

It's possible to do this with DADA2, but it's not recommended!
When using DADA2, each sequencing run should be processed separately so the workflow is the same. This means importing each run, then denoising each run, before finally merging the feature tables, just like Greg described.

In Greg's merge method, the matching sample names will be combined in the single feature table.
In my distinct method, the unique sample names will all be present in the single feature table.

Once you get to the downstream analysis,

Exactly!

Ruitao_Liu · February 18, 2025, 2:57pm

Thanks both of you very much! @colinbrislawn @gregcaporaso