Feature-table filter-samples removes taxonomy from biom table

I noticed that when we have a biom with taxonomy and it gets converted to a qza, the taxonomic information stays in the biom. However, if that qza is filtered, the taxonomic information is erased.

Here an example - note the "Observation Metadata Categories: taxonomy" field in the summaries:

(qiime2-2017.8) 13:29:28 11175$ curl -o biom.biom https://raw.githubusercontent.com/biocore/qiime/master/qiime_test_data/filter_otus_from_otu_table/otu_table.biom
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69265  100 69265    0     0   121k      0 --:--:-- --:--:-- --:--:--  339k
(qiime2-2017.8) 13:29:39 11175$ curl -o map.txt https://raw.githubusercontent.com/biocore/qiime/master/qiime_test_data/validate_mapping_file/Fasting_Map.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   982  100   982    0     0   6538      0 --:--:-- --:--:-- --:--:--  6590
(qiime2-2017.8) 13:29:57 tmp$ ls
biom.biom map.txt
(qiime2-2017.8) 13:29:57 tmp$ biom convert -i biom.biom -o hdf5.biom --table-type="OTU table" --to-hdf5
(qiime2-2017.8) 13:30:03 tmp$ biom summarize-table -i hdf5.biom
Num samples: 9
Num observations: 417
Total count: 1333
Table density (fraction of non-zero values): 0.168

Counts/sample summary:
 Min: 146.0
 Max: 150.0
 Median: 148.000
 Mean: 148.111
 Std. dev.: 1.449
 Sample Metadata Categories: None provided
 Observation Metadata Categories: taxonomy

Counts/sample detail:
PC.481: 146.0
PC.355: 146.0
PC.636: 147.0
PC.635: 148.0
PC.354: 148.0
PC.593: 149.0
PC.607: 149.0
PC.356: 150.0
PC.634: 150.0
(qiime2-2017.8) 13:30:06 tmp$ qiime tools import --input-path hdf5.biom --output-path biom.qza --type "FeatureTable[Frequency]"
(qiime2-2017.8) 13:30:19 tmp$ qiime feature-table filter-samples --p-where 'Treatment = "Control"' --m-metadata-file map.txt --i-table biom.qza --o-filtered-table filtered.qza
Saved FeatureTable[Frequency] to: filtered.qza
(qiime2-2017.8) 13:30:46 tmp$ qiime tools export filtered.qza --output-dir filtered
(qiime2-2017.8) 13:30:53 tmp$ biom summarize-table -i filtered/feature-table.biom
Num samples: 5
Num observations: 240
Total count: 739
Table density (fraction of non-zero values): 0.281

Counts/sample summary:
 Min: 146.0
 Max: 150.0
 Median: 148.000
 Mean: 147.800
 Std. dev.: 1.600
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
PC.355: 146.0
PC.481: 146.0
PC.354: 148.0
PC.593: 149.0
PC.356: 150.0

While putting the example together, I found that:

  • if I don't transform the json biom table to hdf5 the import command errors with: ValueError: Failed to sniff InPath('biom.biom') as BIOMV210Format. Note that the JSON format is a valid biom file - downloaded from Qiime.
  • while feature-table filter-samples by mistake I first was using feature-table filter-features, which, as expected, yielded an empty table but perhaps it will be better to raise an error or at least a warning?

Thanks!

Hi @antgonza!

Looks like you found a bug! Our general strategy with BIOM tables is to normalize them on read/write, so that these biom metadata fields are stripped out (more on that below). It looks like this normalization is skipped when importing BIOMV210Format files (the normalization is applied when importing BIOMV100Format, though).

As far as "stripping out biom metadata", the idea here is that we can represent these data using other QIIME 2 semantic types, for example the taxonomy metadata can be represented as FeatureData[Taxonomy]!

We support importing these types of "fat" biom tables in QIIME 2 by running two (or more) separate import commands:

$ qiime tools import \
  --input-path hdf5.biom \
  --output-path feature-table.qza \
  --type "FeatureTable[Frequency]"
$ qiime tools import \
  --input-path hdf5.biom \
  --output-path taxonomy.qza \
  --source-format BIOMV210Format \
  --type "FeatureData[Taxonomy]"

That error is because the default QIIME 2 source format for FeatureTable[Frequency] is BIOMV210Format, which isn't compatible with your JSON-style BIOM table. You can import JSON-variant BIOM files by specifying the source format as BIOMV100Format:

$ qiime tools import \
  --input-path biom.biom \
  --output-path feature-table.qza \
  --source-format BIOMV100Format \
  --type "FeatureTable[Frequency]"

This issue just came up last week in an internal discussion --- the open issue can be found here.

Thanks!

2 Likes

Got it, thanks! Just FYI, I wasn’t sure about the intended behavior so wasn’t sure if the bug was that the taxonomies where removed or kept, sounds like the latter.

Now, the idea of the biom-table is to store both sample and observation metadata so I’m not sure stripping/normalizing is the best move. I guess the idea of having all (table, qiime mapping file, taxonomies) in a single file is to avoid proliferation of files.

1 Like

We had some discussion offline and in a recent developer call about including taxonomy in .biom files in QIIME 2, but have decided that that is not something we'll support. Here's the justification I gave @antgonza about this (there is some overlap with other content in this topic so just noting that the following wasn't originally written as a reply to this post, but we decided it might help to share this here):

@antgonza, following up on your question about supporting taxonomy in biom files in QIIME 2. The short answer is that we’re not going to support this in the FeatureTable semantic types. The reason is that this violates the core idea of the semantic type system, as the FeatureTable semantic type would no longer unambiguously describe a type of data (it would mean either a feature table, or a feature table with taxonomic annotations). This means that plugin developers couldn’t be sure what they were getting when they request a FeatureTable, which would set us up for a lot of QIIME 1-like problems (e.g., users getting traceback from methods, rather than error messages that can provide detail about what they did incorrectly and how to correct it). That becomes a problem that basic users probably can’t solve on their own (they need to post to the forum). There are a lot of other reasons to keep these data separate - we mentioned a few of these on the call. Your use case (as I understand it, avoiding having to add taxonomic information to a biom file following export from QIIME 2, when you’re developing an automated bioinformatics workflow that uses QIIME 2 and other tools) is straight-forward for an advanced user who would be developing that type of system (at worst, it’s 1-2 extra commands in your code).

Note that in your own plugins, you’re free to define a new semantic type (e.g., FeatureTableWithTaxonomy), which you could use in methods in that plugin or plugins that depend on it. That would no longer violate the idea of the semantic type system, since the type you define would unambiguously describe a type of data.

Also, I mentioned on the call that we support importing FeatureData[Taxonomy] from a biom file that has that information. You can do that as follows:

qiime tools import \
  --input-path my-file.biom \
  --output-path my-taxonomy.qza \
  --source-format BIOMV210Format \
  --type "FeatureData[Taxonomy]"
2 Likes

Thanks @gregcaporaso.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.