I'm having trouble deciding which results are the best.
So I have microbiome data run on Myseq among five batches.
For the subsequent analysis, I only need 110 samples among that are scattered among second and third batch.
My problem is deciding which pipeline is best to remove the batch effect.
For the first pipeline, I merged all batches then executed dada2 on those pool of data then select for 110 samples. I know this is not the correct way to execute dada2, but I tried it anyway. The data is microbiome of HIV positive and control. The samples were selected so the pair (that is control and positive) have the same distribution of age, sex, and location.
Then I tried to do principal component analysis (with Curtis Bray distance). The result is like this:
It seems nicely distributed between case (positive) and control.
Then, I compared with the right way to do dada2, that is, executed dada2 separately for all batches (in this case all 5 batches), merged them, filtered the feature table and repset for 110 samples, then classified the taxonomy and created rooted phylogeny based on those 110 samples alone.
Here, the principal component analysis looks like this:
It seems they are separated between batches (the 110 samples are only within batches 2 and 3).
Any ideas on this phenomenon?
Any help is appreciated.