The effect of abundance matrix on unweighted unifrac distance

We had an abundance matrix, name “mat”. And we had another two matrix, which is changed from mat.They are mat/100 and log(mat). For unweighted unifrac distance, which does not take the abundance info into account, we expect the unifrac distance resulted from mat, mat/100, and log(mat) will be same. However, the 3 unweighted unifrac distance matrix are different. We are confused and looking forward to help. Thank you!

Hey @guojun_wu,

That is a little surprising for the mat/100 case, but perhaps less so for the log(mat) case as there will likely be many zeros and so it could be dropping those samples/features implicitly or generating NaN.

Would it be possible to provide the code/program you are using the normalize your feature table? Or failing that, at least the 3 tables that were generated?

Hi Evan, thank you for your reply.
It seems this is something about the value 1 not the value 0. In my mat/100 matrix, several values are < 1. And I think “qiime diversity beta-phylogenetic” treats all value < 1 as 0 when calculate unweighted unifrac distance. Is that right? Thus the unweighted unifrac distances from my mat and mat/100 are different. We supposed the script will treat values < 1 as 1 in the unweighted unifrac distance calculation, but the cutoff seems is 1. I have a matrix with the minimum non-zero value > 1 and I change it into the 1/0 matrix. The the unweighted unifrac distances from these two are same. Thus, relative abundance matrix is not suitable this calculation and we should use downsized one, right? BTW, why set 1 as the cutoff not 0 when calculating unweighted unifrac distance?

Hi @guojun_wu,

Ah, that actually makes a lot of sense, so you are correct you should only calculate unweighted unifrac on the original table.

The reason this is happening is scipy/numpy will use a "floor" function when converting from a floating point number to an integer (this is a very common way of handling float coercion as it is very computationally cheap to do). This means values like 0.7 will become 0 and values like 1.7 will become 1. Since we're dealing with a qualitative metric, the new zeros everywhere will change the results considerably.

So there doesn't appear to be anything wrong going on, but it certainly is surprising at first glance :slight_smile:

Thank you Evan :grinning:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.