I have a quick question regarding the bioinformatics of various libraries. In my case, I just received the data for 2 libraries and have performed the denoising step. I saw on this tutorial , that I have to merge the the outputs from the denoised data set first and then proceed with the analysis.
I have merged both libraries FeatureTable[Frequencies] and FeatureData[Sequences] and then ran code bellow to get the taxonomy output. Everything seems to be working well, but when I looked at the jaccard_emperor.qzv (897.4 KB) , I can see that the data clusters, but it is clustering along 4 sections, (burned, unburned, burned, unburned).
*I displayed the plot showing: Color = sites (Site 1-6 red(burned), Site 7-9 (green)
unburned) and when I go under the visibility tab, and I un-click burned/unburned
the corresponding colors disappear).*
I am assuming I did something wrong, since it seems like it is because of the 2 different libraries, but I am not sure if this is the case, and or not. Could you please let me know if what I did makes sense. I can also send you other code and or files if you need to see them.
Thanks in advance
Code for merging data
*qiime feature-table merge *
*qiime feature-table merge-seqs *
Code for taxonomy
qiime feature-classifier classify-sklearn
--i-reads DADA2/Fungal-RepSeqsMerged.qza \ #merged rep-seqs
--p-read-orientation reverse-complement \ # I am working with the ITS2 primers
--o-classification Fungal-taxonomy-paired-RC.qza \ #output
Welcome back @Fabs!
Yes indeed this looks like a typical batch effect issue.
Note that batch effects will be most pronounced with Jaccard distance, since it is measuring the proportion of features that are NOT shared by each pair of samples and if you have even the most subtle differences between runs (say, the sequences could be identical but 1 nt longer in one run) then Jaccard distances will all == 1.0.
The solution? The best you can do is attempt to standardize the processing between the two runs as much as possible. It looks like you processed your runs in different ways, e.g., with different trim/trunc lengths with dada2. The trunc lengths might not matter in theory (since your paired-end reads should be overlapping, but who knows! The extra suspicious among us, a.k.a. those trying to merge multiple runs, may want to use a standard trunc length [use the shorter of the two] to keep things absolutely the same) but the trim lengths should definitely be the same for both runs — see my note about 1nt differences above to understand why.
The other possibility is to collapse your features by taxonomy before running alpha and beta diversity tests. Hopefully the similar sequences should have similar taxonomic assignments. You will lose a lot of information, since the ASV-level differences can be important for differentiating samples, but it is the best you can do if standardizing processing steps (e.g., dada2) still leads to pronounced batch effects. (and may still fail if the batch effects are due to per-run contaminants — e.g., if your sites are still not clustering together — but right now this is clearly caused by different processing parameters, not to per-run batch contaminants)
Thank you so much for your help. I knew there was something I did wrong, but was not sure where the error was. So thank you for pointing me in the right direction.
I did as you suggested here, and ran DADA2 with the same perimeters for both libraries and that fixed the problem. I no longer have batch effect.
Again, thank you for explaining the problem to me and offering me a solution. I really appreciate it.
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.