Querying for public microbiome data in Qiita using redbiom

redbiom (canonically pronounced “red biome”) is a caching layer that facilitates the rapid retrieval of existing processed microbiome datasets. It allows for querying presence of exact sequence variants, retrieval of sample data in BIOM format, sample metadata in TSV, searching for samples by metadata, and more. These queries are fast as the cache is built off Redis, an in memory key value database. By default, redbiom issues queries against Qiita. The name of the project is based off its foundations, Redis and BIOM-Format. More information about redbiom can be found on Github and in our mSystems article.

tl; dr: redbiom can be used to rapidly search and obtain processed data and metadata existing studies.

In the tutorial below, we’ll first work through some of the concepts in redbiom, and then show some practical examples including retrieving all of the sample data and metadata from an existing study. redbiom can be used on the command line, or through it’s Python API. In this tutorial, we’ll focus on the command line use.

In brief, redbiom allows for:

  • finding samples that contain a specific or set of features
  • finding samples by arbitrary metadata searches
  • summarizing samples over metadata category
  • retrieval of sample data in BIOM-Format
  • discovering metadata categories that exist in the cache
  • pulling out sample data from different processing types (e.g., search for closed reference, retrieve Deblur)

By default, the features from Qiita that are produced by Deblur are represented as the actual sequence variants, and an example at the end of the tutorial shows how these can be extracted for downstream use (e.g., fragment insertion and phylogenetic analyses).


redbiom is not part of the default QIIME 2 distribution. It can be easily installed by running pip install redbiom or by running conda install -c conda-forge redbiom. More information can be found in the Install section of the README.

Technical and biological replicates

redbiom is designed to handle biological and technical replicates. Specifically, it allows for a one to many relationship between a sample’s metadata and its data. In order to support this, sample IDs are tagged when loaded into the cache so the sample data can be differentiated between preparations. Internally, redbiom keeps track of these relationships, and it reports “ambiguous” samples when writing out BIOM tables and mapping files. An ambiguous association is where the same physical sample is associated with multiple processing runs as would happen with technical replicates.

Command line help

redbiom can be explored on the command line, and uses a nested command line structure similar to QIIME 2.

$ redbiom --help
Usage: redbiom [OPTIONS] COMMAND [ARGS]...

  --version  Show the version and exit.
  --help     Show this message and exit.

  admin      Update database, etc.
  fetch      Sample data and metadata retrieval.
  search     Feature and sample search support.
  select     Select items based on metadata
  summarize  Summarize things.

Many of the redbiom commands are designed to consume data over standard input, and to dump data to standard output, allowing commands to be “piped” together. An example of this type of operation is at the end of the tutorial.


First step with redbiom is to determine what context or contexts you’d like to use. redbiom organizes data into “contexts” to group data processed in a common way. For example, one context might be composed of Illumina 16S v4 data processed by closed reference OTU picking against Greengenes. The motivation for contexts is to group similar data together in order to reduce biases when comparing results. Data are loaded into contexts and searches for samples by feature happen within contexts.

It is handy to list the available contexts (note that the context names are not assured to be stable at this time). In the first column of output, we have the context name which is hopefully human readable. In the second column, we can find the number of samples represented in the context, and the third column describes the number of features. The fourth column is a description that is presently unused.

As an example, the first context listed below is "Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-100nt-a243a1". The human interpretation is that the samples in this context were picked closed reference against Greengenes, all of the samples were run on an Illumina platform, all targeted the v4 region, and all of the sequences were trimmed such that they are all 100 nucleotides long. (The “a243a1” is a arbitrary tag that can be ignored).

$ redbiom summarize contexts
ContextName	SamplesWithData	FeaturesWithData	Description
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-100nt-a243a1	129596	74983	Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v2-41ebc6	3034	24839	Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-18S-v9-150nt-bd7d4d	153	72	Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v45-100nt-a243a1	22	8178	Qiita context
Deblur-NA-illumina-16S-v4-90nt-99d1d8	119538	4460311	Qiita context
Pick_closed-reference_OTUs-Greengenes-titanium-16S-v46-90nt-44feac	215	4328	Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v4-90nt-44feac	116	3109	Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v6-8-150nt-bd7d4d	110	4985	Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v2-100nt-a243a1	3035	24833	Qiita context

$ redbiom summarize contexts | wc -l

Any sample data query to redbiom must specify a context in order to obtain the desired data, so it is useful to define an environment variable. Let’s use the Deblur context for Illumina 16S v4 data trimmed at 150 nucleotides:

export CTX=Deblur-Illumina-16S-V4-150nt-780653

Fetching samples based on the features they contain, metadata, and more

We will be downloading several files, so before we go further, let’s create a directory to contain them.

mkdir querying-redbiom
cd querying-redbiom

First, let’s take a look at how many samples are index that also have the “qiita_empo_3” metadata category. This category represents the Earth Microbiome Project ontology, and includes the data from the EMP manuscript as well as inferred EMPO3 category values where possible using existing sample data (e.g., some samples had metadata describing them as fecal, so that sample was inferred to be of an “Animal distal gut” EMPO3 value). redbiom allows us to investigate and summarize the available metadata in a category:

$ redbiom summarize metadata-category --category "qiita_empo_3" --counter
Category value	count
Hypersaline (saline)	13
Mock community	110
Surface (saline)	123
Sediment (non-saline)	442
Plant surface	532
Plant rhizosphere	676
Plant corpus	678
Sediment (saline)	705
Animal proximal gut	1194
Water (saline)	1505
anthropogenic sample	1652
Sterile water blank	2544
Water (non-saline)	4189
Animal secretion	6507
Soil (non-saline)	7376
Surface (non-saline)	8277
Animal corpus	12616
Animal surface	12625
Animal distal gut	36519

Now let’s have some fun. We can search for all of the samples containing a given set of features. For instance, let’s start by defining a set of interesting sequence variants:

$ cat > features << END

We can now search for all samples which contain these same features (one or all is supported), and summarize the found samples by their metadata category. In this example, we found at least 10 samples which contain our features of interest, 10 of those samples have the “qiita_empo_3” metadata category described, and all 10 of those samples report as “Animal distal gut” within the “qiita_empo_3” metadata category:

$ redbiom summarize features --from features --context $CTX --category qiita_empo_3
Animal distal gut	10

Total samples	10

We can also get the sample identifiers back::

$ redbiom search features --from features --context $CTX > samples
$ cat samples

IMPORTANT: The identifiers above are delimited by an underscore ("_"), a reserved character for redbiom. IDs of this form in redbiom are used internally so that sample ambiguities can be tracked. The structure is <tag>_<sample-id>. So for example, “31192_1448.HCO07” has the tag “31192”, and the actual sample identifier is “1448.HCO07”. This will become important in a moment when we obtain sample data and metadata.

We could also restrict our observed samples to only those associated with humans by selecting host_taxid 9606:

$ redbiom select samples-from-metadata --from samples --context $CTX "where host_taxid==9606"

Finally we can fetch the data for a set of samples across all features. Note that we’re pulling all of the samples we found, not just the ones restricted to humans.

$ redbiom fetch samples --from samples --context $CTX --output data.biom
1 sample ambiguities observed. Writing ambiguity mappings to: data.biom.ambiguities

And let’s quickly take a look at the data we obtained. Take a note of the sample identifiers: these are not structured like the internal redbiom IDs but are instead re-ordered into the form <sample-id>.<tag>.

$ biom summarize-table -i data.biom
Num samples: 11
Num observations: 4,373
Total count: 226,230
Table density (fraction of non-zero values): 0.117

Counts/sample summary:
 Min: 6,214.000
 Max: 37,983.000
 Median: 20,013.000
 Mean: 20,566.364
 Std. dev.: 10,268.189
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
10317.000020624.30756: 6,214.000
10317.000005692.27052: 7,931.000
10317.000053457.29082: 9,267.000
10508.BC114.6.fec.30968: 11,789.000
1924.Sadowsky.19.26743: 17,564.000
1448.HCO07.31192: 20,013.000
10317.000014551.27507: 28,102.000
2086.1325901136.36211: 28,873.000
2086.1325901136.28104: 28,873.000
10323.P51Y69.F.0911C.30841: 29,621.000
10323.O41G42.F.0611A.30841: 37,983.000

When we obtained the data, redbiom reported a single ambiguity. Any ambiguities observed are stored in a JSON object. The interpretation here is that the physical sample “2086.1325901136” corresponds to two samples with data, possibly biological or technical replicates. Both of the samples are represented in our output BIOM table (…it appears the sample may be a duplicated across preparations as the read count is identical).

$ cat data.biom.ambiguities 
{"2086.1325901136": ["28104_2086.1325901136", "36211_2086.1325901136"]}

Ambiguities arise where multiple data sets exist for the same physical sample, for instance where the same sample was run on multiple preparations (e.g., resequencing). The details are saved in the .ambiguities file.

Fetching metadata

In addition to the sample data, we can also fetch the metadata for a set of samples:

$ redbiom fetch sample-metadata --from samples --context $CTX --output metadata
1 sample ambiguities observed. Writing ambiguity mappings to: metadata.ambiguities
$ cat metadata
#SampleID	body_habitat	body_product	body_site	description	dna_extracted	elevation	env_biome	env_feature	env_package	host_subject_id	host_taxid	latitude	longitude	physical_specimen_location	physical_specimen_remaining	qiita_study_id	sample_type	scientific_name
1924.Sadowsky.19.26743	UBERON:feces	UBERON:feces	UBERON:feces	Day 17 CD1	TRUE	443.82907	urban biome	human-associated habitat	human-gut	CD1	9606	46.72955	-94.6859	UCSD	TRUE	1924	Stool	human gut metagenome
10317.000005692.27052	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	229.4	dense settlement biome	human-associated habitat	human-gut	5a3021c536bd4597726ed37b71d8821ff6a389fa433510b51d2b03fa812fb0f41aac2eb345c7d63554842689d21efc440512f1841f1564543727fc20a6314aaf	9606	-36.8	144.3	UCSDMI	true	10317	Stool	human gut metagenome
10317.000014551.27507	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	78.6	dense settlement biome	human-associated habitat	human-gut	4a41054eaea250cae18ad8e1afbec0711398cfc74cd28af327b61e5aeeabd00ee7f67e08232b72becb27092e107f12da18ae9c9b373a9d001078643ef2ab8811	9606	41.2	-73.9	UCSDMI	true	10317	Stool	human gut metagenome
2086.1325901136.28104	UBERON:feces	UBERON:feces	UBERON:feces	No Additive___Day 0	True	304.32	urban biome	human-associated habitat	human-gut	1006 MMC	9606	44.02	-92.4699	Mayo Clinic	False	2086	stool	human gut metagenome
10317.000053457.29082	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	6.2	dense settlement biome	human-associated habitat	human-gut	66757f949a0e9def83ab0a1537113e585b6fced1aa84d2faf72565922d4fe1cf63dc7f2fcd49866a7c5ac036f37f5f35657439cad2342ed1ef51784171db2b6a	9606	50.7	-3.1	UCSDMI	true	10317	Stool	human gut metagenome
10317.000020624.30756	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	35.1	dense settlement biome	human-associated habitat	human-gut	258176caaa5a7a3b549c6ef14d3b6cb1261eed0568bd51aad1bb504ec0e9b5e35427e00703f85f55599324f763fc4039e5cc199a48cac0f3c623b2e6d9fc5c25	9606	51.5	-0.1	UCSDMI	true	10317	Stool	human gut metagenome
10323.O41G42.F.0611A.30841	UBERON:feces	UBERON:feces	UBERON:feces	gazelle fecal O41G42-F	true	822.97	tropical grassland biome	animal-associated habitat	host-associateO41G42-F	27591	-0.02356	37.906	U Georgia	false	10323	stool	gut metagenome
10323.P51Y69.F.0911C.30841	UBERON:feces	UBERON:feces	UBERON:feces	gazelle fecal P51Y69-F	true	822.97	tropical grassland biome	animal-associated habitat	host-associateP51Y69-F	27591	-0.02356	37.906	U Georgia	false	10323	stool	gut metagenome
10508.BC114.6.fec.30968	UBERON:feces	UBERON:feces	UBERON:feces	BC114.6.fec	true	33	urban biome	animal-associated habitat	host-associated	BC114	10090	40.742	-73.97399999999998	NYUMC	true	10508	stool	mouse gut metagenome
1448.HCO07.31192	UBERON:feces	UBERON:feces	UBERON:feces	HCO07	true	5102.3	village biome	human-associated habitat	human-gut	HCO07	9606	-12.0	-76.0	OU Lewis lab	false	1448	stool	human gut metagenome
2086.1325901136.36211	UBERON:feces	UBERON:feces	UBERON:feces	No Additive___Day 0	True	304.32	urban biome	human-associated habitat	human-gut	1006 MMC	9606	44.02	-92.4699	Mayo Clinic	False	2086	stool	human gut metagenome

Importing data from redbiom into QIIME 2

Finally, it is straightforward to import a downloaded biom file into QIIME 2 for subsequent analysis or integration:

$ qiime tools import --input-path data.biom --output-path data.qza --type FeatureTable[Frequency]
$ qiime tools peek data.qza
UUID:        bba4f7df-09a7-4955-8572-b20d5d28e175
Type:        FeatureTable[Frequency]
Data format: BIOMV210DirFmt

Searching for samples, and fetching data from a specific study

redbiom indexes the sample metadata on load, and uses natural language processing techniques to allow for arbitrary queries for samples. Perhaps the most useful type of search is to just obtain all the sample data in BIOM from an existing study. In this example, we’re obtaining Qiita study 2136:

$ redbiom search metadata "where qiita_study_id == 2136" | redbiom fetch samples --context $CTX --output study.biom

We could, however, search for all samples in which the word “soil” is used somewhere in their metadata, and to further refine the search to return only samples which have a “ph” column where that value is greater than 7:

$ redbiom search metadata "soil where ph > 7" | wc -l

We can of course also get the BIOM table data from this result. So let’s do that, and in doing so, we’re also going to highlight how redbiom commands can be “piped” together:

$ redbiom search metadata "soil where ph > 7" | redbiom fetch samples --context $CTX --output soil_data.biom
$ biom summarize-table -i soil_data.biom | head
Num samples: 300
Num observations: 175,652
Total count: 7,806,041
Table density (fraction of non-zero values): 0.016

Counts/sample summary:
 Min: 9.000
 Max: 63,526.000
 Median: 29,839.500
 Mean: 26,020.137

You may have noticed that the number of samples obtained is much smaller than our initial query. This is most likely because either not all of the 826 samples were run on Illumina, not all of them were 16S v4, or not all of them were long enough be included in a 150nt trim.

Extracting features for phylogenetic analyses

In the Qiita contexts, the features represented in the closed reference contexts map directly into the Greengenes 13_8 tree. In the Deblur contexts, the features contained correspond to the sequence variants themselves. Let’s take the soil table we produced above, extract the features, and create a QIIME 2 artifact with a FeatureData[Sequence] semantic type. First, let’s take a look at the first 3 features. Note that we’re using the “table-ids” subcommand of biom, and we’re specifying the --observations parameter to obtain the features:

$ biom table-ids -i soil_data.biom --observations | head -n 3

The trick here is that we need to create a FASTA file so the data can be easily understood by QIIME 2. One way we can do this is by writing a very small awk program where we use the feature itself as both the identifier and the sequence. Because the command is getting a little long, we’re going to break it up over multiple lines using \:

$ biom table-ids -i soil_data.biom --observations | \
    head -n 3 | \
    awk '{ print ">" $1 "\n" $1 }' 

Finally, let’s do the same but redirect to a file, and import it into QIIME 2. Note that the head command has been removed so that we can get all of the features:

$ biom table-ids -i soil_data.biom --observations | \
    awk '{ print ">" $1 "\n" $1 }' > soil_data.fa
$ qiime tools import --input-path soil_data.fa \
    --output-path soil_data_rep_seqs.qza \
    --type FeatureData[Sequence]
$ qiime tools peek soil_data_rep_seqs.qza
UUID:        c2f233eb-e777-4fd2-a636-49a6431b0aaa
Type:        FeatureData[Sequence]
Data format: DNASequencesDirectoryFormat


Thank you for taking the time to read through this community tutorial. It was put together by @BenKaehler and @wasade. We hope you find this tool useful, and please do not hesitate to report issues or feature requests so that we can continue to improve it!

Edit history

Oct. 28 2019 - added a mention of using conda for install
Aug. 24 2020 - updated the context used (thanks @hotblast!)


This tutorial is not working starting from this step.
(base) pn1933734:querying-redbiom garyxie$ redbiom summarize features --from features --context $CTX --category qiita_empo_3
Traceback (most recent call last):
File “/Users/garyxie/opt/anaconda3/bin/redbiom”, line 10, in
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 764, in call
return self.main(*args, **kwargs)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 717, in main
rv = self.invoke(ctx)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 555, in invoke
return callback(*args, **kwargs)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/redbiom/commands/summarize.py”, line 158, in summarize_features
iterable, exact)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/redbiom/summarize.py”, line 65, in category_from_features
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/redbiom/_requests.py”, line 179, in valid
raise ValueError(“Unknown context: %s” % context)
ValueError: Unknown context: Deblur-NA-illumina-16S-v4-150nt-780653

1 Like

Thanks, @hotblast! The context names have been revised since this tutorial was written. Deblur-Illumina-16S-V4-150nt-780653 should be the correct one now.