Is it necessary to transform the read count data into relative abundance form before normalization when analyzing using two different ecosystems (different batches)?


I am considering conducting a correlation analysis using a linear model and a heat map, as well as co-occurrence network analysis using samples from two different body sites in one host.

The NGS run for each body site was done separately, and as a result, the total read counts and abundance of the two different sites are quite different. Hence, I am considering using normalization methods (CLR) to combine the two sets of data before conducting the analysis.

However, I am confused about the starting point of normalization. Is it necessary to transform the read count data into a form of relative abundance before merging and normalizing to adjust for the total read count from different batches (body sites)?

I am not quite good at statistics, so I would appreciate some advice from people who have experience with similar analysis or have seen related studies.

If anyone has used normalization before conducting a correlation analysis, please give me some advice. Any advice would be helpful in helping me gain a better understanding.

1 Like

Hi @SingeunOh!

What kind of hypotheses are you looking to test?

Good idea, and it turns out you really need CLR even within a run, as we have end up with a fixed, but uncontrolled sequencing depth per-sample. This means that if you know the (abitrary) sequencing depth of a sample, you only have N-1 degrees of freedom in the abundance of observed ASV/OTUs.

However there's a slightly different problem than this which you will want to consider, since each bodysite became its own run, you will be unable to differentiate effects between bodysites from effects between sequencing runs/preperation (i.e. batch effect). There's not a good way to overcome this. Was the sequencing/sample preparation partitioned strictly by bodysite, or is there hopefully some samples from different bodysites within a single run?

Fortunately no, the CLR handles this in a more comprehensive way.

Once you have CLR features instead of regular abundances, a plain linear model ought to be workable, however you are still going to need to define a reasonable one and figure out how to handle the aforementioned batch-effects.

1 Like

Thank you!! I have understood what you mean. Yes, the different run would be a problem. Nevertheless, I have found that the CLR would be a good way to handle this situation. My hypothesis is one genus of one body site has significantly positive correlation to the same genus of another body site, which is one way for the proof of transfer from one site to another. Thanks a lot.