Thank you for the great forum and the valuable discussions! I’ve searched extensively but couldn’t find definitive answers to some questions about my dataset. I’d greatly appreciate your guidance.
Background
I have 16S sequencing data from an experimental setup with summer and winter samples (from different sites). Each season's samples were sequenced twice, as the initial runs for both seasons had a high failure rate. Some samples succeeded in the first run but failed in the second and vice versa.
Questions
Pooling for Increased Depth:
Can I merge reads for the same exact samples sequenced in different runs to increase sequencing depth or coverage? If yes, what is the best approach to do this while minimizing bias?
Accepting Samples from One Run Only:
Is it acceptable to use reads from only one run for samples that failed in the other? that would end up running dada2 with samples from different runs!
Minimum Read Count Threshold:
For gut microbiome samples, is there a recommended minimum read count threshold for inclusion in downstream analysis? Some of my samples have as few as 3,000–10,000 reads. Should I exclude these low-depth samples?
Avoiding Run-Related Bias:
Given that I have four sequencing runs (same machine, region, and read length), what additional steps should I take to minimize biases when:
analyzing samples from different runs if this is ok or
Pooling reads for the same samples across runs?
Your insights would be invaluable to start my analysis.
Thank you for providing a lot of details and sharing your considerations!
First, I will go through your questions:
Yes, you can.
To minimize the bias, it is better to run dada2 with the same (important) settings for all different runs that you have, and only then merge feature tables. There is an option to summarize sample reads if IDs are overlapping.
Yes it is. If you run dada2 first, and only then merge feature tables, then it is not an issue.
You can plot alpha-rarefaction curves or just manually inspect the merged feature table visualization to choose the sequencing depth. For example, 3000 reads may be enough if otherwise you loose too much of the samples. If there are only a couple of samples with low depth and the rest are at least 5000 (just an example), I would go for 5000.
There will still be bias, and I would include the run info into the metadata (run1, run2, run1_run2 for merged). You can then check whether it is significant or not, and you also include it in the formulas (in Adonis or ANOVA).
To summarize:
Import each run separately.
Remove primers with cutadapt and the same settings from each run.
Denoise each run with dada2 and the same settings.
Add run info to the metadata file.
Merge feature tables and representative sequences files from different runs. Use "sum" option for overlapping samples.
Proceed further as with samples from one run, but check the batch effect based on the metadata.
Hope that I didn't miss something. Please feel free (to all) to add more instructions if there are some things that should be mentioned.
Thank you so much. I followed your suggestions and everything works fine. Just a follow question please, Where to start to test batch effects. Now in my metadata file i have a column run where runs info are. So I have some samples comes from one run like A, B, C or merged from different runs like A_B_D and so on.
I would start with beta diversity metrics and check emperor visualization. Ideally you should see a stronger effect of other metadata categories (diet, sample type, group, etc) than such of sequencing run. But it is not always the case. You also can include sequencing run column into Adonis formula when checking the metadata factors.