How to obtain machine-readable total observation counts?

Hello there!

I have recently started working with QIIME2 and so far the docs and forum have been great resources to resolve the problems that I encountered. However, now I seem to be stuck on a new problem that hopefully someone can help with.

I want to filter potential crosstalk features from my feature table before running further analysis. I have observed that ASVs originating from crosstalk between barcodes/samples on average have <0.05% relative frequency in a given sample in my sequencing setup. So I want to apply a 0.05% abundance filter to each sample in my feature table as a way of dealing with cross-contamination.
qiime feature-table filter-features only provides filtering features by absolute frequency, but of course, total feature counts differ among my samples. If the feature table was rarefied to a specific number of features it would be straightforward to calculate 0.05% of this number and apply qiime feature-table filter-features --p-min-frequency to the whole table. But I don’t want to rarefy due to the drawbacks of rarefaction that have been discussed in this forum many times. So I decided that I will split the feature table into separate tables for each sample, export the biom file, make a biom summary, extract the total number of observations (3rd line of biom file summary), use that to calculate the 0.05% threshold for each sample, filter each table individually (qiime feature-table filter-features --p-min-frequency) and then merge, all in a bash script (currently the only scripting language I am somewhat proficient with).
The problem is that the number in biom file summary contains thousand separators (e.g. Total count: 127 524) and I can’t get bash to recognize this as a number due to the gap between groups of digits. Is there any built-in method in QIIME2 or some hack that anyone knows to get the total feature count of samples programmatically and without this formatting? I know I can read it from feature table summary visualizations and type manually but that will not work when I have hundreds of samples.

If you wanted to preserve your existing script, you could try to strip the internal whitespace in the number: string - How to trim whitespace from a Bash variable? - Stack Overflow

You could use the QIIME 2 Artifact API to get this information:

import qiime2
import pandas as pd

table_fp = '/path/to/my/table.qza'
table = qiime2.Artifact.load(table_fp)
df = table.view(pd.DataFrame)
total_observations_per_sample = df.sum(axis='columns')

The contents of total_observations_per_sample looks like:

L1S105    7788.0
L1S140    7163.0
L1S208    8162.0
L1S257    6405.0
L1S281    6630.0
L1S57     8716.0
L1S76     7871.0
L1S8      7037.0
L2S155    3932.0
L2S175    4386.0
L2S204    3161.0
L2S222    3187.0
L2S240    5061.0
L2S309    1419.0
L2S357    2373.0
L2S382    4089.0
L3S242     898.0
L3S294    1225.0
L3S313    1103.0
L3S341     953.0
L3S360     971.0
L3S378    1249.0
L4S112    8340.0
L4S137    9820.0
L4S63     9744.0
L5S104    2227.0
L5S155    1800.0
L5S174    1953.0
L5S203    2112.0
L5S222    2525.0
L5S240    1792.0
L6S20     6857.0
L6S68     5982.0
L6S93     6953.0
dtype: float64

Hope that helps! :qiime2:

2 Likes

Thank you very much @thermokarst!

I was able to tweak the Python solution that you suggested and embed it into my bash script to get a working solution. As for whitespace removal in bash, the gap between numbers appears to be some unknown character, not the regular space, so this solution did not work and I was unable to determine what character it actually is.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.