What's formula of The Retained Features number?

Crunk · May 15, 2023, 6:15am

Hi, I'm a new QIIME2 user. For more understanding of the Feature Table, I extracted the draw data (feature-table.biom file) come from summarize step. And try to summarize by myself in Python to comprehend how Retained Features are calculated.

The input file:
data.csv (228.4 KB)
The summary file:
ftable.qzv (468.4 KB)

Everything is smooth at the beginning.

Frequency per sample

Code

import pandas as pd
table_data = pd.read_csv('data.csv', index_col=0)
def countFeature(lista):
    count = 0
    for ele in lista:
        if ele > 0:
            count += 1
    return count

FixList = lambda lista: [i for l in lista for i in l]

CountData = lambda features, table_data: [countFeature(table_data[table_data.index==feature].transpose().astype('float64').values) for feature in features]
SumData = lambda features, table_data: [sum(FixList(table_data[table_data.index==feature].transpose().astype('float64').values)) for feature in features]

def MakeTable(features, table_data):
    count_values = CountData(features, table_data)
    sum_values = SumData(features, table_data)

    frePerFeature = pd.DataFrame(zip(sum_values, count_values), index=features, columns=['Frequency', 'Count'])
    # print(frePerFeature)
    return frePerFeature

headers = table_data.columns
sum_values = [sum(table_data[header]) for header in headers]
count_values = [countFeature(table_data[header].astype('float64').values) for header in headers]

frePerSample = pd.DataFrame(zip(sum_values, count_values), index=headers, columns=['Frequency', 'Count'])
print(frePerSample)
print(frePerSample.describe())
print(frePerSample.median())

Output:

                                      Frequency  Count
0bb5e24a.aa34.48e7.b1f9.e761ac2dc6b4     6141.0     89
101c02ce.4f8d.4394.83a3.0c406831d934     6473.0     83
104e5902.1c3e.417f.bb22.124bfca61a61     7040.0     85
11c4be47.fdce.4135.8acf.0894f2da5ede     2949.0     41
11c70d46.e0a4.4797.ad7b.3677fe93d73b     4807.0     33
...                                         ...    ...
ea68ac1a.53e3.4523.924f.32ca1d2ac988     3921.0     30
eb4ece61.6afe.4c88.8a06.d616dab1dd34     2193.0     77
fa4e5dcc.7b0a.4fb1.989d.b19aec98e047     4847.0     69
fa6fda2b.4a99.4231.9325.0d74d8ae451c     4168.0     72
fadb8582.a342.46ba.8926.98ff97240824     4284.0     84

[116 rows x 2 columns]

        Freequency       Count
count   116.000000  116.000000
mean   3921.198276   63.741379
std    1752.358208   20.189158
min     876.000000   16.000000
25%    2601.250000   51.000000
50%    3758.000000   64.500000
75%    4832.250000   77.000000
max    8273.000000  102.000000
Freequency    3758.0
Count           64.5
dtype: float64

Frequency per Features

Code

features = table_data.index
frePerFeature = MakeTable(features, table_data)
print(frePerFeature)
print(frePerFeature.describe())
print(frePerFeature.median())

Output

                                  Frequency  Count
002e78333d6cf2b11aa7a5ba03dd2c68       59.0      3
0046913ae6f9e12dbd889671ed26c09d        5.0      1
009c3d1fd56cf5682c875f959d9fee33       63.0      4
00c7ff8350c1f1874014b473b1890bf3       17.0      1
014ed0286a2651609cc4a145b7ea6788       71.0      1
...                                     ...    ...
fe05bf03bd86129ba77552ce64c521a0        9.0      1
fe7577ac70fe20955ccd2fb1a3847a59        7.0      1
fe89f7b053b400e066cbe98313774360       16.0      3
febf051ad9dbae4bb7247b25f64e7773        5.0      1
fee475b024cb0ce7531b21360168e790        9.0      1

         Frequency       Count
count    788.000000  788.000000
mean     577.232234    9.383249
std     3654.757368   17.151550
min        2.000000    1.000000
25%        8.000000    1.000000
50%       35.500000    2.000000
75%      229.750000    9.000000
max    93838.000000  103.000000
Frequency    35.5
Count         2.0
dtype: float64

The above stuff corresponds Overview tab and these are correct. But when calculating the Retained Features I failed. In my opinion, the Retained Features number is the sum of the Frequency of the remaining sample. But the result is wrong. If someone elucidates me, I'll very much appreciate it.

Interactive Sample detail

Code

sample_depth = 1206

retain_sample = frePerSample[frePerSample['Frequency']>=sample_depth]

num_retain_sample = len(retain_sample)
print(f'The number of retained samples: {num_retain_sample}')

retained_num = retain_sample['Frequency'].sum()
print(f'The number of retained features: {num_retain_sample}')

Output

The number of retained samples:  114                    # Correct
The number of retained features: 452833.0             # Wrong

Crunk · May 15, 2023, 8:54am

I have just found out my answer. Because after the subsampling step, all sample's frequencies will equal each other and equal the Sample Depth, so the right final part is:

Interactive

Code

sample_depth = 1206

retain_sample = frePerSample[frePerSample['Frequency']>=sample_depth]
num_retain_sample = len(retain_sample)
print(f'The number of samples remain: {num_retain_sample}')
print(f'The number of samples remain: {sample_depth*num_retain_sample}')

Output

The number of samples remain: 114
The number of samples remain: 137484

timanix · May 15, 2023, 8:57am

Hi and welcome to the forum!
Glad that you already found answer at your question. And thank you for sharing the solution with the community ! It can help other users as well.

Crunk · May 15, 2023, 2:35pm

Complement

This graph shows the dependence of the Percentage of Retained Samples and Retained Features according to the Sample Depth.

Code

num_sample = len(frePerSample)
num_feature = len(frePerFeature)
max_depth = max(frePerSample['Frequency'])
total_fre = sum(frePerSample['Frequency'])

def SampleDepth(sample_depth, frePerSample):
    retain_sample = frePerSample[frePerSample['Frequency']>=sample_depth]
    num_retain_sample = len(retain_sample)*100/num_sample
    num_retain_feature = sample_depth*num_retain_sample*100/total_fre
    return num_retain_feature, num_retain_sample

import numpy as np

depths = list(range(int(max_depth)+1))

retain_feature = [] 
retain_sample = []
for depth in depths:
    num1, num2 = SampleDepth(depth, frePerSample)
    # print(num1, num2)
    retain_feature.append(num1)
    retain_sample.append(num2)

import matplotlib.pyplot as plt
plt.figure(figsize=(15,8))
plt.plot(depths, retain_feature, label='Percent of features');
plt.plot(depths, retain_sample, color='y', label='Percent of samples');
plt.xlabel('Sample Depth')
plt.xticks(list(range(0,int(max_depth)+1,1000)))
plt.ylabel('Percentage')
plt.ylim([0,100])
plt.xlim([0,int(max_depth)+1])
plt.legend()
plt.show()