redbiom
(canonically pronounced "red biome") is a caching layer that facilitates the rapid retrieval of existing processed microbiome datasets. It allows for querying presence of exact sequence variants, retrieval of sample data in BIOM format, sample metadata in TSV, searching for samples by metadata, and more. These queries are fast as the cache is built off Redis, an in memory key value database. By default, redbiom
issues queries against Qiita. The name of the project is based off its foundations, Redis and BIOM-Format. More information about redbiom
can be found on Github and in our mSystems article.
tl; dr: redbiom
can be used to rapidly search and obtain processed data and metadata existing studies.
In the tutorial below, we'll first work through some of the concepts in redbiom
, and then show some practical examples including retrieving all of the sample data and metadata from an existing study. redbiom
can be used on the command line, or through it's Python API. In this tutorial, we'll focus on the command line use.
In brief, redbiom
allows for:
- finding samples that contain a specific or set of features
- finding samples by arbitrary metadata searches
- summarizing samples over metadata category
- retrieval of sample data in BIOM-Format
- discovering metadata categories that exist in the cache
- pulling out sample data from different processing types (e.g., search for closed reference, retrieve Deblur)
By default, the features from Qiita that are produced by Deblur are represented as the actual sequence variants, and an example at the end of the tutorial shows how these can be extracted for downstream use (e.g., fragment insertion and phylogenetic analyses).
Install
redbiom
is not part of the default QIIME 2 distribution. It can be easily installed by running pip install redbiom
or by running conda install -c conda-forge redbiom
. More information can be found in the Install section of the README.
Technical and biological replicates
redbiom
is designed to handle biological and technical replicates. Specifically, it allows for a one to many relationship between a sample's metadata and its data. In order to support this, sample IDs are tagged when loaded into the cache so the sample data can be differentiated between preparations. Internally, redbiom
keeps track of these relationships, and it reports "ambiguous" samples when writing out BIOM tables and mapping files. An ambiguous association is where the same physical sample is associated with multiple processing runs as would happen with technical replicates.
Command line help
redbiom
can be explored on the command line, and uses a nested command line structure similar to QIIME 2.
$ redbiom --help
Usage: redbiom [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
admin Update database, etc.
fetch Sample data and metadata retrieval.
search Feature and sample search support.
select Select items based on metadata
summarize Summarize things.
Many of the redbiom
commands are designed to consume data over standard input, and to dump data to standard output, allowing commands to be "piped" together. An example of this type of operation is at the end of the tutorial.
Contexts
First step with redbiom
is to determine what context or contexts you'd like to use. redbiom
organizes data into "contexts" to group data processed in a common way. For example, one context might be composed of Illumina 16S v4 data processed by closed reference OTU picking against Greengenes. The motivation for contexts is to group similar data together in order to reduce biases when comparing results. Data are loaded into contexts and searches for samples by feature happen within contexts.
It is handy to list the available contexts (note that the context names are not assured to be stable at this time). In the first column of output, we have the context name which is hopefully human readable. In the second column, we can find the number of samples represented in the context, and the third column describes the number of features. The fourth column is a description that is presently unused.
As an example, the first context listed below is "Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-100nt-a243a1"
. The human interpretation is that the samples in this context were picked closed reference against Greengenes, all of the samples were run on an Illumina platform, all targeted the v4 region, and all of the sequences were trimmed such that they are all 100 nucleotides long. (The "a243a1" is a arbitrary tag that can be ignored).
$ redbiom summarize contexts
ContextName SamplesWithData FeaturesWithData Description
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-100nt-a243a1 129596 74983 Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v2-41ebc6 3034 24839 Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-18S-v9-150nt-bd7d4d 153 72 Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v45-100nt-a243a1 22 8178 Qiita context
Deblur-NA-illumina-16S-v4-90nt-99d1d8 119538 4460311 Qiita context
Pick_closed-reference_OTUs-Greengenes-titanium-16S-v46-90nt-44feac 215 4328 Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v4-90nt-44feac 116 3109 Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v6-8-150nt-bd7d4d 110 4985 Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v2-100nt-a243a1 3035 24833 Qiita context
...
$ redbiom summarize contexts | wc -l
106
Any sample data query to redbiom
must specify a context in order to obtain the desired data, so it is useful to define an environment variable. Let's use the Deblur context for Illumina 16S v4 data trimmed at 150 nucleotides:
export CTX=Deblur-Illumina-16S-V4-150nt-780653
Fetching samples based on the features they contain, metadata, and more
We will be downloading several files, so before we go further, let's create a directory to contain them.
mkdir querying-redbiom
cd querying-redbiom
First, let's take a look at how many samples are index that also have the "qiita_empo_3" metadata category. This category represents the Earth Microbiome Project ontology, and includes the data from the EMP manuscript as well as inferred EMPO3 category values where possible using existing sample data (e.g., some samples had metadata describing them as fecal, so that sample was inferred to be of an "Animal distal gut" EMPO3 value). redbiom
allows us to investigate and summarize the available metadata in a category:
$ redbiom summarize metadata-category --category "qiita_empo_3" --counter
Category value count
Hypersaline (saline) 13
Mock community 110
Surface (saline) 123
Sediment (non-saline) 442
Plant surface 532
Plant rhizosphere 676
Plant corpus 678
Sediment (saline) 705
Animal proximal gut 1194
Water (saline) 1505
anthropogenic sample 1652
Sterile water blank 2544
Water (non-saline) 4189
Animal secretion 6507
Soil (non-saline) 7376
Surface (non-saline) 8277
Animal corpus 12616
Animal surface 12625
Animal distal gut 36519
Now let's have some fun. We can search for all of the samples containing a given set of features. For instance, let's start by defining a set of interesting sequence variants:
$ cat > features << END
TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGTAGGCGGATTGGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGG
TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAGGGAGCGCAGGCGGTCTGGCAAGTCTGATGTGAAATACCGGGGCTTAACCCCGGAGCTGCATCCAAAACTGTAGTTCTTGAGTGGAGTAGAGGTAAGCGGAATTCCGAGT
TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGAAGTGAAATCTATGGGCTTAACCCATAAACTGCTTTCAAAACTGCTGGTCTTGAGTGATGGAGAGGCAGGCGGAATTCCGTG
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGCGGTAAGACAAGTCAGAAGTGAAAACCCAGGGCTTAACTCTGGGACTGCTTTTGAAACTGTCAGACTGGAGTGCAGGAGAGGTAAGCGGAATTCCTAG
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTGTAAAGGGTGCGTAGACGGGAGAACAAGTTAGTTGTGAAATACCTCGGCTCAACTGAGGAACTGCAACTAAAACTGTACTTCTTGAGTGCAGGAGAGGTAAGTGGAATTACTAG
TACGTAGGTGACAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGCGTAGGCGGGATAGCAAGTCAGTCGTGAAATACCGGAGCTCAACTCCGGGGCTGCGATTGAAACTGTTATTCTTGAGTATCGGAGAGGAAAGCGGAATTCCTGG
TACGTAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGTAGACGGTGATGTAAGTCTGATGTGAAAGCCTCCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCTATG
TACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAGGGCGTGTAGCCGGGAAGGCAAGTCAGATGTGAAATCCATGGGCTCAACCTCCAGCCTGCATTTGAAACTGTAGTTCTTGAGTGTCGGAGAGGCAATCGGAATTCCGTGT
TACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGTGCAGCCGGGCTGACAAGTCAGATGTGAAATCCCGAGGCTTAACCTCGGAACTGCATTTGAAACTGTTAGTCTTGAGTATCGGAGAGGTCATCGGAATTCCTTG
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGAGCGTAGACGGCTGTGCAAGTCTGAAGTGAAAGCCCGGGGCTCAACCCCGGGACTGCTTTGGAAACTGTGCAGCTAGAGTGTCGGAGAGGTAAGCGGAATTCCTAG
END
We can now search for all samples which contain these same features (one or all is supported), and summarize the found samples by their metadata category. In this example, we found at least 10 samples which contain our features of interest, 10 of those samples have the "qiita_empo_3" metadata category described, and all 10 of those samples report as "Animal distal gut" within the "qiita_empo_3" metadata category:
$ redbiom summarize features --from features --context $CTX --category qiita_empo_3
Animal distal gut 10
Total samples 10
We can also get the sample identifiers back::
$ redbiom search features --from features --context $CTX > samples
$ cat samples
36211_2086.1325901136
27052_10317.000005692
31192_1448.HCO07
28104_2086.1325901136
27507_10317.000014551
30841_10323.O41G42.F.0611A
30841_10323.P51Y69.F.0911C
30756_10317.000020624
29082_10317.000053457
26743_1924.Sadowsky.19
30968_10508.BC114.6.fec
IMPORTANT: The identifiers above are delimited by an underscore ("_"), a reserved character for redbiom
. IDs of this form in redbiom
are used internally so that sample ambiguities can be tracked. The structure is <tag>_<sample-id>
. So for example, "31192_1448.HCO07" has the tag "31192", and the actual sample identifier is "1448.HCO07". This will become important in a moment when we obtain sample data and metadata.
We could also restrict our observed samples to only those associated with humans by selecting host_taxid
9606:
$ redbiom select samples-from-metadata --from samples --context $CTX "where host_taxid==9606"
31192_1448.HCO07
27052_10317.000005692
27507_10317.000014551
29082_10317.000053457
30756_10317.000020624
28104_2086.1325901136
36211_2086.1325901136
26743_1924.Sadowsky.19
Finally we can fetch the data for a set of samples across all features. Note that we're pulling all of the samples we found, not just the ones restricted to humans.
$ redbiom fetch samples --from samples --context $CTX --output data.biom
1 sample ambiguities observed. Writing ambiguity mappings to: data.biom.ambiguities
And let's quickly take a look at the data we obtained. Take a note of the sample identifiers: these are not structured like the internal redbiom
IDs but are instead re-ordered into the form <sample-id>.<tag>
.
$ biom summarize-table -i data.biom
Num samples: 11
Num observations: 4,373
Total count: 226,230
Table density (fraction of non-zero values): 0.117
Counts/sample summary:
Min: 6,214.000
Max: 37,983.000
Median: 20,013.000
Mean: 20,566.364
Std. dev.: 10,268.189
Sample Metadata Categories: None provided
Observation Metadata Categories: None provided
Counts/sample detail:
10317.000020624.30756: 6,214.000
10317.000005692.27052: 7,931.000
10317.000053457.29082: 9,267.000
10508.BC114.6.fec.30968: 11,789.000
1924.Sadowsky.19.26743: 17,564.000
1448.HCO07.31192: 20,013.000
10317.000014551.27507: 28,102.000
2086.1325901136.36211: 28,873.000
2086.1325901136.28104: 28,873.000
10323.P51Y69.F.0911C.30841: 29,621.000
10323.O41G42.F.0611A.30841: 37,983.000
When we obtained the data, redbiom
reported a single ambiguity. Any ambiguities observed are stored in a JSON object. The interpretation here is that the physical sample "2086.1325901136" corresponds to two samples with data, possibly biological or technical replicates. Both of the samples are represented in our output BIOM table (...it appears the sample may be a duplicated across preparations as the read count is identical).
$ cat data.biom.ambiguities
{"2086.1325901136": ["28104_2086.1325901136", "36211_2086.1325901136"]}
Ambiguities arise where multiple data sets exist for the same physical sample, for instance where the same sample was run on multiple preparations (e.g., resequencing). The details are saved in the .ambiguities
file.
Fetching metadata
In addition to the sample data, we can also fetch the metadata for a set of samples:
$ redbiom fetch sample-metadata --from samples --context $CTX --output metadata
1 sample ambiguities observed. Writing ambiguity mappings to: metadata.ambiguities
$ cat metadata
#SampleID body_habitat body_product body_site description dna_extracted elevation env_biome env_feature env_package host_subject_id host_taxid latitude longitude physical_specimen_location physical_specimen_remaining qiita_study_id sample_type scientific_name
1924.Sadowsky.19.26743 UBERON:feces UBERON:feces UBERON:feces Day 17 CD1 TRUE 443.82907 urban biome human-associated habitat human-gut CD1 9606 46.72955 -94.6859 UCSD TRUE 1924 Stool human gut metagenome
10317.000005692.27052 UBERON:feces UBERON:feces UBERON:feces American Gut Project Stool Sample true 229.4 dense settlement biome human-associated habitat human-gut 5a3021c536bd4597726ed37b71d8821ff6a389fa433510b51d2b03fa812fb0f41aac2eb345c7d63554842689d21efc440512f1841f1564543727fc20a6314aaf 9606 -36.8 144.3 UCSDMI true 10317 Stool human gut metagenome
10317.000014551.27507 UBERON:feces UBERON:feces UBERON:feces American Gut Project Stool Sample true 78.6 dense settlement biome human-associated habitat human-gut 4a41054eaea250cae18ad8e1afbec0711398cfc74cd28af327b61e5aeeabd00ee7f67e08232b72becb27092e107f12da18ae9c9b373a9d001078643ef2ab8811 9606 41.2 -73.9 UCSDMI true 10317 Stool human gut metagenome
2086.1325901136.28104 UBERON:feces UBERON:feces UBERON:feces No Additive___Day 0 True 304.32 urban biome human-associated habitat human-gut 1006 MMC 9606 44.02 -92.4699 Mayo Clinic False 2086 stool human gut metagenome
10317.000053457.29082 UBERON:feces UBERON:feces UBERON:feces American Gut Project Stool Sample true 6.2 dense settlement biome human-associated habitat human-gut 66757f949a0e9def83ab0a1537113e585b6fced1aa84d2faf72565922d4fe1cf63dc7f2fcd49866a7c5ac036f37f5f35657439cad2342ed1ef51784171db2b6a 9606 50.7 -3.1 UCSDMI true 10317 Stool human gut metagenome
10317.000020624.30756 UBERON:feces UBERON:feces UBERON:feces American Gut Project Stool Sample true 35.1 dense settlement biome human-associated habitat human-gut 258176caaa5a7a3b549c6ef14d3b6cb1261eed0568bd51aad1bb504ec0e9b5e35427e00703f85f55599324f763fc4039e5cc199a48cac0f3c623b2e6d9fc5c25 9606 51.5 -0.1 UCSDMI true 10317 Stool human gut metagenome
10323.O41G42.F.0611A.30841 UBERON:feces UBERON:feces UBERON:feces gazelle fecal O41G42-F true 822.97 tropical grassland biome animal-associated habitat host-associateO41G42-F 27591 -0.02356 37.906 U Georgia false 10323 stool gut metagenome
10323.P51Y69.F.0911C.30841 UBERON:feces UBERON:feces UBERON:feces gazelle fecal P51Y69-F true 822.97 tropical grassland biome animal-associated habitat host-associateP51Y69-F 27591 -0.02356 37.906 U Georgia false 10323 stool gut metagenome
10508.BC114.6.fec.30968 UBERON:feces UBERON:feces UBERON:feces BC114.6.fec true 33 urban biome animal-associated habitat host-associated BC114 10090 40.742 -73.97399999999998 NYUMC true 10508 stool mouse gut metagenome
1448.HCO07.31192 UBERON:feces UBERON:feces UBERON:feces HCO07 true 5102.3 village biome human-associated habitat human-gut HCO07 9606 -12.0 -76.0 OU Lewis lab false 1448 stool human gut metagenome
2086.1325901136.36211 UBERON:feces UBERON:feces UBERON:feces No Additive___Day 0 True 304.32 urban biome human-associated habitat human-gut 1006 MMC 9606 44.02 -92.4699 Mayo Clinic False 2086 stool human gut metagenome
Importing data from redbiom into QIIME 2
Finally, it is straightforward to import a downloaded biom
file into QIIME 2 for subsequent analysis or integration:
$ qiime tools import --input-path data.biom --output-path data.qza --type FeatureTable[Frequency]
$ qiime tools peek data.qza
UUID: bba4f7df-09a7-4955-8572-b20d5d28e175
Type: FeatureTable[Frequency]
Data format: BIOMV210DirFmt
Searching for samples, and fetching data from a specific study
redbiom
indexes the sample metadata on load, and uses natural language processing techniques to allow for arbitrary queries for samples. Perhaps the most useful type of search is to just obtain all the sample data in BIOM from an existing study. In this example, we're obtaining Qiita study 2136:
$ redbiom search metadata "where qiita_study_id == 2136" | redbiom fetch samples --context $CTX --output study.biom
We could, however, search for all samples in which the word "soil" is used somewhere in their metadata, and to further refine the search to return only samples which have a "ph" column where that value is greater than 7:
$ redbiom search metadata "soil where ph > 7" | wc -l
826
We can of course also get the BIOM table data from this result. So let's do that, and in doing so, we're also going to highlight how redbiom
commands can be "piped" together:
$ redbiom search metadata "soil where ph > 7" | redbiom fetch samples --context $CTX --output soil_data.biom
$ biom summarize-table -i soil_data.biom | head
Num samples: 300
Num observations: 175,652
Total count: 7,806,041
Table density (fraction of non-zero values): 0.016
Counts/sample summary:
Min: 9.000
Max: 63,526.000
Median: 29,839.500
Mean: 26,020.137
You may have noticed that the number of samples obtained is much smaller than our initial query. This is most likely because either not all of the 826 samples were run on Illumina, not all of them were 16S v4, or not all of them were long enough be included in a 150nt trim.
Extracting features for phylogenetic analyses
In the Qiita contexts, the features represented in the closed reference contexts map directly into the Greengenes 13_8 tree. In the Deblur contexts, the features contained correspond to the sequence variants themselves. Let's take the soil table we produced above, extract the features, and create a QIIME 2 artifact with a FeatureData[Sequence]
semantic type. First, let's take a look at the first 3 features. Note that we're using the "table-ids" subcommand of biom
, and we're specifying the --observations
parameter to obtain the features:
$ biom table-ids -i soil_data.biom --observations | head -n 3
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTCTGGTGTTTAATCCTGGGGCTCAACTCCGGGTCGCACTGGAAACTGGTAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGT
GTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGCGCAGGCTGCGCGGACAGTCAAATGTGAAATTCAGGGGCTCAACCCCTGCATTGCGCTTGATACTTCCGCGCTCGAGCCTTGG
TACGTAGGGACCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGTGGTAAGTCACTTGTGAAATCTCTGAGCTTAACTCAGAACGGCCAAGTGATACTGCTGTGCTCGAGTGTGGAAGGGGCAATCGGAATTCTTGG
The trick here is that we need to create a FASTA file so the data can be easily understood by QIIME 2. One way we can do this is by writing a very small awk
program where we use the feature itself as both the identifier and the sequence. Because the command is getting a little long, we're going to break it up over multiple lines using \
:
$ biom table-ids -i soil_data.biom --observations | \
head -n 3 | \
awk '{ print ">" $1 "\n" $1 }'
>TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTCTGGTGTTTAATCCTGGGGCTCAACTCCGGGTCGCACTGGAAACTGGTAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGT
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTCTGGTGTTTAATCCTGGGGCTCAACTCCGGGTCGCACTGGAAACTGGTAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGT
>GTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGCGCAGGCTGCGCGGACAGTCAAATGTGAAATTCAGGGGCTCAACCCCTGCATTGCGCTTGATACTTCCGCGCTCGAGCCTTGG
GTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGCGCAGGCTGCGCGGACAGTCAAATGTGAAATTCAGGGGCTCAACCCCTGCATTGCGCTTGATACTTCCGCGCTCGAGCCTTGG
>TACGTAGGGACCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGTGGTAAGTCACTTGTGAAATCTCTGAGCTTAACTCAGAACGGCCAAGTGATACTGCTGTGCTCGAGTGTGGAAGGGGCAATCGGAATTCTTGG
TACGTAGGGACCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGTGGTAAGTCACTTGTGAAATCTCTGAGCTTAACTCAGAACGGCCAAGTGATACTGCTGTGCTCGAGTGTGGAAGGGGCAATCGGAATTCTTGG
Finally, let's do the same but redirect to a file, and import it into QIIME 2. Note that the head
command has been removed so that we can get all of the features:
$ biom table-ids -i soil_data.biom --observations | \
awk '{ print ">" $1 "\n" $1 }' > soil_data.fa
$ qiime tools import --input-path soil_data.fa \
--output-path soil_data_rep_seqs.qza \
--type FeatureData[Sequence]
$ qiime tools peek soil_data_rep_seqs.qza
UUID: c2f233eb-e777-4fd2-a636-49a6431b0aaa
Type: FeatureData[Sequence]
Data format: DNASequencesDirectoryFormat
Conclusion
Thank you for taking the time to read through this community tutorial. It was put together by @BenKaehler and @wasade. We hope you find this tool useful, and please do not hesitate to report issues or feature requests so that we can continue to improve it!
Edit history
Oct. 28 2019 - added a mention of using conda
for install
Aug. 24 2020 - updated the context used (thanks @hotblast!)