Merging denoised outputs

Nicholas_Bokulich · December 4, 2019, 4:42pm

Welcome back @Fabs!

Yes indeed this looks like a typical batch effect issue.

Note that batch effects will be most pronounced with Jaccard distance, since it is measuring the proportion of features that are NOT shared by each pair of samples and if you have even the most subtle differences between runs (say, the sequences could be identical but 1 nt longer in one run) then Jaccard distances will all == 1.0.

The solution? The best you can do is attempt to standardize the processing between the two runs as much as possible. It looks like you processed your runs in different ways, e.g., with different trim/trunc lengths with dada2. The trunc lengths might not matter in theory (since your paired-end reads should be overlapping, but who knows! The extra suspicious among us, a.k.a. those trying to merge multiple runs, may want to use a standard trunc length [use the shorter of the two] to keep things absolutely the same) but the trim lengths should definitely be the same for both runs — see my note about 1nt differences above to understand why.

The other possibility is to collapse your features by taxonomy before running alpha and beta diversity tests. Hopefully the similar sequences should have similar taxonomic assignments. You will lose a lot of information, since the ASV-level differences can be important for differentiating samples, but it is the best you can do if standardizing processing steps (e.g., dada2) still leads to pronounced batch effects. (and may still fail if the batch effects are due to per-run contaminants — e.g., if your sites are still not clustering together — but right now this is clearly caused by different processing parameters, not to per-run batch contaminants)