merging sequencing run and batch effect

Sunil · October 23, 2020, 12:38pm

Hello

I am using qiime2-2019.10 in conda environment installed in HPC of the university.
I am analyzing fecal microbiome data obtained from the 4 plates sequenced in Illumina MiSeq in two sequencing runs (each run included 2 plates samples i.e. 96 X 2 =192). Initially I used dada2 quality filtering with the following commands for both the runs separately:

qiime dada2 denoise-paired
--i-demultiplexed-seqs run1.qza
--p-trim-left-f 13
--p-trim-left-r 12
--p-trunc-len-r 151
--p-trunc-len-f 150
--o-denoising-stats dada-denoise-stats_run1.qza
--o-table dada_table_run1.qza
--o-representative-sequences dada-rep-seqs_run1.qza

qiime dada2 denoise-paired
--i-demultiplexed-seqs run2.qza
--p-trim-left-f 13
--p-trim-left-r 13
--p-trunc-len-r 150
--p-trunc-len-f 150
--o-denoising-stats dada-denoise-stats_run2.qza
--o-table dada_table_run2.qza
--o-representative-sequences dada-rep-seqs_run2.qza

and the I merged the output (the run information was incorporated in the metadata file) and performed the beta diversity analysis which shown two strong clusters based on the sequencing runs and I assumed that the two sequencing runs have a very strong batch effect. Later I reanalyzed the data with the exact same filtering and truncation parameters using the dada2 for both sequencing runs followed by merging the output using command 'qiime feature-table merge', and the results were drastically different. further when I performed the beta diversity analysis and the clustering based on the sequencing run was totally lost. So I wanted to know whether 2-3 base differences in trimming can cause such a huge difference in the outcome?
bray_curtis_20000_Seq.run_anosim(1).qzv (1.0 MB) bray_curtis_unifrac_Seq.run_20000_ADONIS(1).qzv (310.6 KB) bray_curtis_20000_Seq.run_permanova(1).qzv (1.0 MB)

Further to get answers whether the clustering in PCoA is significant or not, I performed the beta-diversity significance test for later analysis using 'beta-group-significance' command using both PERMANOVA (pseudo-F=1.02031, p-value=0.382) and ANOSIM (R=0.000258745,p-value=0.369). So if I am getting it correct then the result says that there is no significant variation or similarity; but the R2-value =1.000000e+00 and P=1.0 in adonis is very high and indicate (if I am getting it right) that there is a strong influence of sequencing run on the data distribution, or I am misinterpreting the data completely?
I hope my questions are clear.
Thank you.

ChrisKeefe · October 23, 2020, 11:35pm

Welcome to the forum, @sunil!

I haven't experimented with this myself and can't confirm definitively, but I suspect that this artificial batch effect is why the best practice is to use identical trim/trunc parameters whenever denoising runs that are to be merged after DADA2.

I'm not a statistician, but I suspect you've just got your p-values on backwards. Generally low p-values (e.g. 0.0001) indicate significance.

Another thing to note - when working with model-based tools like Adonis, your p-value is only meaningful within the scope of the model. In other words, you might have a great p-value, but if your data doesn't meet the model assumptions, that p value may be meaningless. The residuals plots in the Adonis visualizers can help in diagnosing this, but if you're not sure, it's always worth asking a statistician.

Happy :qiime2:-ing!
Chris

Sunil · October 26, 2020, 5:06pm

Thank you @ChrisKeefe!
From the analysis it seem like indeed it was a case of artificial batch effect, I totally agree that we must use identical trimming and truncation parameters for denoising the data from the different sequencing runs.
You mentioned that the I've got p-values on backwards, any comments or suggestion why it is happening? is it because of the data itself or the command I've used? And how can we resolve this.
following is the command which I have used:

qiime diversity adonis
--i-distance-matrix bray_curtis_distance_matrix.qza
--m-metadata-file merged_mapfile.tsv
--p-formula Seq.run
--o-visualization bray_curtis_Seq.run_20000_ADONIS.qzv

Thank you

ChrisKeefe · October 26, 2020, 5:52pm

Sorry if my joke was unclear, @Sunil.

A high p-value indicates weak evidence against the null hypothesis - we would fail to reject. In other words, assuming your model assumptions are all met, a low p-value would indicate significance.

system · November 26, 2020, 11:52pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.