alpha diversity analysis in QIIME2 vs Jupyter Notebook (python)

Hello,

I'm trying to make a table of the Kruskal-Wallis results for 35 different variables. It would take a long time to copy and paste each value from the .qzv/view files. So I created a script in python using the tsv files produced in QIIME2 of Shannon, FaithPD, Evenness and ObservedOTUs metrics.

However, I noticed they don't produce the exact same results. Could you help me explain why? Are there any nuances I should take into account?

I'm using these commands in python:

import pandas as pd
from scipy.stats import kruskal
        # Perform Kruskal-Wallis test
        H, p = kruskal(category1_values, category2_values)

And I'm using this command in QIIME2:

!qiime diversity alpha-group-significance \
  --i-alpha-diversity faith_pd_vector.qza \
  --m-metadata-file sample-metadata-v2.tsv \
  --o-visualization faith-pd-group-significance_230625.qzv

I appreciate your help!
Malena

Hi @malenaamer, You should be getting the same result. I suspect there may be a difference in how the lists of values being compared are created. Could you provide more context on how you're computing your category1_values and category2_values? Ideally you could share code showing this starting from the same artifact that you're providing to alpha-group-significance.

If you haven't used QIIME 2 artifacts from within Python before, you can load your Faith PD vector and sample metadata into pandas objects (a pd.Series and pd.DataFrame, respectively) as follows:

In [1]: import qiime2

In [3]: import pandas as pd

In [4]: faith_pd = qiime2.Artifact.load('./faith_pd_vector.qza').view(pd.Series)

In [6]: sample_metadata = qiime2.Metadata.load('./sample-metadata.tsv').to_dataframe()
3 Likes

Thank you so much for the quick response, Greg!

I'm glad I asked the question because I definitely had a mistake in the list preparation as you suggested, I was actually skipping a row:

import pandas as pd
from scipy.stats import kruskal

# Read the Excel
#MISTAKE! -->data = pd.read_excel('Manually_made_file.xls', sheet_name='AlphaGraphs', header=0, skiprows=[1])
data = pd.read_excel('Manually_made_file.xls', sheet_name='AlphaGraphs', header=0)
columns_to_analyze = data.columns[7:73].tolist()
print(columns_to_analyze)
alpha_diversity_outcomes = ['Shannon', 'Observed_OTUs', 'FaithPD', 'Evenness']

# Kruskal-Wallis
results = []
for variable in columns_to_analyze:
    for outcome in alpha_diversity_outcomes:
        category1_values = data.loc[data[variable] == 0, outcome]
        category2_values = data.loc[data[variable] == 1, outcome]
        #print(category1_values)
        #print(category2_values)
        
        H, p = kruskal(category1_values, category2_values)

        results.append({
            'Testing Variable': variable,
            'Outcome': outcome,
            'H': H,
            'p-value': p, #only two categories, no q
        })

As you can notice in my code, I used a "Manually_made_file.xls". I made it by downloading all the .TSV files made from the alpha_diveristy.qzv files and copying the data into different columns in excel. Then I also merged the metadata file.

I had not called the QIIME2 artifacts from python yet, but I am going to update my script now so I don't have to manually copy and paste the downloaded .TSV file from the view.qiime2 page. I'm having trouble with the automatic merge because the .qza file has as the index the SampleIDs and then the data as a column. So I need to find a way to merge them properly.

2 Likes

Glad that you found an error - I silently followed your post since I am also applying Kruskal-Wallis test outside of qiime2 since it is faster for me and I don't remember when I checked for the last time if results are the same with qiime2.

I just add all alpha diversity metrics to metadata file:

def add_alpha(qza):  
    a = !unzip $qza
    out = a[1].split('/')[0].replace('  inflating: ', '')
    inf = f'{out}/data/alpha-diversity.tsv'
    df = pd.read_csv(inf, sep='\t', index_col=0)
    !rm -rf $out
    return df 


META = pd.read_csv('metadata.tsv', sep='\t', index_col=0)
ALPHAS = {
    'shannon': 'Shannon entropy', 
    'observed_features': 'Observed features',
    'faith_pd': 'Faith\'s PD',
    'evenness': 'Pielou\'s evenness'}


COREM = 'Results/Core-metrics'
for alpha in ALPHAS:
    data = add_alpha(f'{COREM}/{alpha}_vector.qza')
    META[alpha] = data.iloc[:,  0]

META.to_csv('alpha.tsv', sep='\t')

This file then can be used for different tests and plots.
I use dictionary for alpha metrics instead of list just because I added metric names as values that I usually plot on figures instead of metrics itself.

3 Likes

Thank you so much for sharing this, I'll give it a try!
Have you run a Lefse analysis in python?

Hi!
Yes, I run lefse analysis couple of times on the data front Qiime2 in Jupyter lab. Now I prefer Ancom-BC that is already installed in the latest versions of Qiime2 since I like this analysis more

2 Likes

I didn't know about this analysis, why do you like it better? Can you link a tutorial?
I'm trying to make prettier graphs with my Lefse results, but it's a slow process for a beginner :slight_smile:

1 Like

I like this comparison of DA tests:

And, of course, this one:

Besides, Lefse should be installed into different environment and one will need to export feature table in Qiime2, add metadata information like class and subclass and then switch to the other environment, convert it to the right format for lefse and then run lefse.

I think Ancom-BC is more appropriate for compositional microbiome data, have an advantage of formulas for multiple factors and is better for comparisons with more than two levels.

Best,

2 Likes

Great, thank you so much for sharing this! I'll add it to my "To Learn" list :slight_smile:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.