I have two sequence data from different sequencing runs.
When I conducted DADA2 on the two data with particular parameters separately, I got significantly different numbers of ASV between two runs.
My question is, how do I know the difference is caused by DADA2 (different parameters), batch effect (different runs) or real effect?
Thank you!
Hi @11112,
Which region are you using? Assuming you using bacteria 16S, with almost fully overlapping sequences (as in region v4 for example), using different trimming lengths ( with ‘–trim’ parameters which affect the initial part of the read), you will obtain different sets of ASVs.
That is because dada2 does not clusters the sequences but you may think as 100% similarity dereplicates. Hence:
5’ATGCGTGCGT3’ (obtained with shorter trimming length)
5’ CGTGCGT3’ (obtained with longer trimming length)
As seen as different ASVs!
So, my suggestion is to try to run both datasets with the same parameters (as in fact is the common suggestion at least in the forum), and try to exclude this factor.
Are the samples in the two runs similar in taxonomic composition? (same type of samples) Or they may change a lot?
Do you have any positive control/known samples in both runs? If so it may be worthy to look at these to see any difference from the expected.
Are the samples in both runs processed at same time (with same reagent kits)/different reagent kits/lab condition/storage conditions? Any of these may affect the bacteria in samples and apply a bias and result in batch effects.
Another option to test this is to try to denoise both runs together with deblur, which is created to process samples form different origin, and see if the batch effect is still present.
Hope it helps
Luca