Merging sample sets with drastically different frequencies

Hi Everyone,

I don't have a strong background in bioinformatics and could really use some input. I have samples from two different Illumina runs (MiSeq 2x250 and MiSeq 2x250 Nano) that I wish to compare. The big concern is the difference in feature frequency between the two methods (250,000 vs 5,000) after processing the samples with dada2. I can see that many features in the larger set are low abundance and could be removed, but I could also just use a sampling depth appropriate for the low frequency samples. I have not been able to find a best practice for this situation and am unsure which would provide a more legitimate result.

Hello Nick,

Welcome to the forums! :qiime2:

This has been a big topic for the last few years:

If you have not found these already, here's a good starting point:
Why subsampling is (always!) bad: Waste not, want not: why rarefying microbiome data is inadmissible - PMC
Why subsampling is (often!) fine: Normalization and microbial differential abundance strategies depend upon data characteristics - PMC

There are more related resources, like the inherently compositional paper and Pat Schloss's Let's Play videos where he shows that none of the normalization methods work as well as rarefaction.

2 Likes