Dear all!
Me again.
I produced a taxabarplot file with grouped by niche samples:
Niche_taxabarplot.qzv
commands:
qiime feature-table group \
--i-table Infected_table.qza \
--p-axis 'sample' \
--m-metadata-file Metadata/metadata.tsv \
--m-metadata-column Niche \
--p-mode 'mean-ceiling' \
--o-grouped-table Infected_Niche_grouped.qza
qiime taxa barplot \
--i-table Infected_Niche_grouped.qza \
--i-taxonomy Data/combo_taxonomy.qza \
--m-metadata-file Metadata/Niche_metadata \
--o-visualization $taxabar Niche_taxabarplot.qzv
Then we decided to test the differences between the same Phylum in two different niches - roots and galls. But in taxabarplot, abundances already grouped by Niches. So I created another taxabarplot with original table:
Infected_taxabarplot.qzv
commands:
qiime taxa barplot \
--i-table Data/tables/Infected_table.qza \
--i-taxonomy Data/combo_taxonomy.qza \
--m-metadata-file Metadata/metadata.tsv \
--o-visualization Results/Taxa_barplots/Infected_taxabarplot.qzv
Now I want to apply Mann–Whitney U test to test differences in relative abundances of top 10 features between two niches:
The code in Python 3 I used:
import pandas as pd
from numpy import average
from itertools import combinations
from scipy.stats import mannwhitneyu as mwu
from statsmodels.stats.multitest import multipletests
def bar_unzip(qza,lev):
a = !unzip $qza
digest = a[1].split('/')[0].replace(' inflating: ','')
inf = digest + '/data/level-%s.csv'%lev
data = pd.read_csv(inf, sep=',',index_col=0)
!rm -r $digest
return data
def relative(df):
cols = [col for col in df.columns if 'D_1__' in col]
df = df.loc[:,cols]
df.columns = [col.split('D_1__')[-1] for col in df.columns]
df = df.loc[:, (df.sum(axis=0) != 0)]
df = df.div(df.sum(axis=1), axis=0)*100
df = df.append(df.agg(['mean']))
df.sort_values(inplace=True,axis=1,by='mean',ascending=False)
df.drop(inplace=True,index='mean')
return df
data = bar_unzip('Results/Taxa_barplots/Infected_taxabarplot.qzv',2)
# Root vs Gall
r_df,g_df = relative(data.loc[data.Niche=='Root'].copy()),relative(data.loc[data.Niche=='Gall'].copy())
cols = list(set(r_df.columns.tolist()[0:10]+g_df.columns.tolist()[0:10]))
mwu_results = pd.DataFrame(columns=['mean_root','mean_gall','u','p'])
for col in cols:
root,gall = r_df[col].tolist(),g_df[col].tolist()
u, p = mwu(root,gall)
mwu_results.loc[col] = [average(root),average(gall),u,p]
mwu_results['q'] = multipletests(mwu_results.p,method='fdr_bh')[1]
mwu_results['p'] = round(mwu_results['p'],4)
mwu_results['q'] = round(mwu_results['q'],4)
mwu_results.sort_values(inplace=True,axis=0,by='mean_root',ascending=False)
mwu_results.to_csv('Results/Taxa_barplots/Root_vs_Gall_top10.csv')
display(mwu_results)
In the result, I got this table:
Here, we can see, that, for example, mean relative abundance of Proteobacteria in Root samples equals to 45.86, but in my previous Niche_taxabarplot file with grouped by Niche tables it is showing, that relative abundance of Proteobacteria in root samples equals to 42.79
Could someone help me to find a mistake in my code?