Merge or discard sequences from duplicated samples across multiple runs?

Hi friends,

I have a 16S V4-V5 dataset (primers are 515F, 926R) that includes 4 different MiSeq runs with 368 total samples. Most of these samples appear in two of the four runs, but a couple samples appear in three or even all four runs. My sequencing center told me that they resequenced many of my samples to get more reads per sample.

After de-noising each of the four runs separately using DADA2, I can see that many of these samples are inadequate for downstream analysis, as expected. For example, 11 of the samples in Run1 contain no reads at all, and 50 of the samples contain less than 1,000 reads.

My question is what should I do with these duplicated samples moving forward?

The two possibilities I see now are:

  1. Choose the duplicate sample with the highest number of reads and discard the duplicate samples from the other runs. If I were to do this, is there a generally accepted threshold for the minimum number of sequences that a sample should have?

  2. Merge the reads across the duplicate samples (probably by summing them, not by averaging them), effectively treating them as composite samples.

Hi @groot, :deciduous_tree:
Welcome to the forum!

I would drop the one with the least sequences. Manually go through the separate runs and use filter-samples to remove those.

Not really, it really depends on your sample type and what kind of biological question you are trying to answer. See this thread for a little more in depth answer

I wouldn’t merge the 2 samples, the risk of combining batch effects from 2 separate runs is too great. This will likely lead to these merged samples to have an artificially unique signature resulting from the merging that other samples didn’t have.

Hey Bod,

Thanks for the help and resources. I agree with you about the batch effect concern.

My hesitation is coming from the fact that ~8% of samples could be saved from the inevitable rarefaction guillotine if I were to merge the duplicates by summing the sequences together…


However, I think I’ll be able to get more reads per sample by re-doing the de-noising step a little differently. Hopefully at that point I will no longer be tempted to sum sequences across duplicates. To be continued!

Thanks again

Ouch, :crossed_swords: :face_with_head_bandage:. That is unfortunate. How much reads are you currently getting? I ask because if it is ok amount for these samples, then there’s actually quite a bit of analysis you can do that doesn’t require rarefaction. Though of course, you still want a decent # of reads to be representative of the overall community. I still think the risk is too great to merge some samples but not others, especially if they come from different runs but I’d be curious what others think about this. @jwdebelius, @Nicholas_Bokulich?

I would definitely invest some effort into this, would be your best option. Feel free to start a new thread with questions specific to that if you want, there’s loads of discussion on that on the forum too.

Good luck!

Hi @Mehrbod_Estaki and @groot,

I’m generally anti merging sequencing replicates, especially ones that you’ve resequenced. So, I’d pick the one with more samples and move on! There may be a reason certain samples failed to amplify (bad extraction is a common one) and so I’d just be careful. I would also consider your rarefaction threshhold.

For high biomass samples, I like something around 5000 seqs/sample which I know many people think is low, but :woman_shrugging:. The patterns hold at that depth, Bray Curtis/shannon are pretty stable and I keep a fair number of samples for my rarefied analyses. In lower biomass enviroments, I might work at 2500 or 1250… depending on what I need. A rarefaction depth of 10,000 seqs/sample isn’t necessary, but if you’re below 1000, it’s a bad sample you should be thrown out whether you’re doing rarefaction-less analysis or not.

It sounds like at 8% you’re losing a lot of samples, but it also depends on sample type. If they’re low or ultra low biomass, I might expect a 30-50% failure rate. (Inevitably, there is an inverse relationship with the ease of collecting samples and their sequencing success.)

I’m with Bod on this, too! If this can give you a better solution and help you save reads, I think I’d look there.



Hi @Mehrbod_Estaki and @jwdebelius,
Thank you to both of you for taking the time to provide these useful guidelines moving forward.

It turns out that I had a big problem with failing to merge paired-end reads in the denoising step, causing me to lose a ton of sequences. I started over using just the forward reads and it fixed the problem. The filter-samples function also worked great.

Thank you!