Running dada2 with different batch data

luqrei · April 17, 2020, 2:10pm

Hi everyone!

I'm having trouble deciding which results are the best.

So I have microbiome data run on Myseq among five batches.

For the subsequent analysis, I only need 110 samples among that are scattered among second and third batch.

My problem is deciding which pipeline is best to remove the batch effect.

For the first pipeline, I merged all batches then executed dada2 on those pool of data then select for 110 samples. I know this is not the correct way to execute dada2, but I tried it anyway. The data is microbiome of HIV positive and control. The samples were selected so the pair (that is control and positive) have the same distribution of age, sex, and location.

Then I tried to do principal component analysis (with Curtis Bray distance). The result is like this:

It seems nicely distributed between case (positive) and control.

Then, I compared with the right way to do dada2, that is, executed dada2 separately for all batches (in this case all 5 batches), merged them, filtered the feature table and repset for 110 samples, then classified the taxonomy and created rooted phylogeny based on those 110 samples alone.

Here, the principal component analysis looks like this:

It seems they are separated between batches (the 110 samples are only within batches 2 and 3).

Any ideas on this phenomenon?

Any help is appreciated.

Thank you.

Nicholas_Bokulich · April 20, 2020, 4:01pm

Hi @luqrei,

You need to ensure that all batches are processed with the exact same trimming and truncation settings, so that the reads are the same exact lengths. Even a 1-nt different could cause the issues that you are seeing with Bray-Curtis distance.

Another possibility is to use phylogenetic methods like UniFrac instead of Bray Curtis.

Good luck!

luqrei · April 28, 2020, 5:37am

Thank you so much for your helpful answer.