I have a 16S V4-V5 dataset (primers are 515F, 926R) that includes 4 different MiSeq runs with 368 total samples. Most of these samples appear in two of the four runs, but a couple samples appear in three or even all four runs. My sequencing center told me that they resequenced many of my samples to get more reads per sample.
After de-noising each of the four runs separately using DADA2, I can see that many of these samples are inadequate for downstream analysis, as expected. For example, 11 of the samples in Run1 contain no reads at all, and 50 of the samples contain less than 1,000 reads.
My question is what should I do with these duplicated samples moving forward?
The two possibilities I see now are:
Choose the duplicate sample with the highest number of reads and discard the duplicate samples from the other runs. If I were to do this, is there a generally accepted threshold for the minimum number of sequences that a sample should have?
Merge the reads across the duplicate samples (probably by summing them, not by averaging them), effectively treating them as composite samples.