Querying for public microbiome data in Qiita using redbiom

redbiom (canonically pronounced "red biome") is a caching layer that facilitates the rapid retrieval of existing processed microbiome datasets. It allows for querying presence of exact sequence variants, retrieval of sample data in BIOM format, sample metadata in TSV, searching for samples by metadata, and more. These queries are fast as the cache is built off Redis, an in memory key value database. By default, redbiom issues queries against Qiita. The name of the project is based off its foundations, Redis and BIOM-Format. More information about redbiom can be found on Github and in our mSystems article.

tl; dr: redbiom can be used to rapidly search and obtain processed data and metadata existing studies.

In the tutorial below, we'll first work through some of the concepts in redbiom, and then show some practical examples including retrieving all of the sample data and metadata from an existing study. redbiom can be used on the command line, or through it's Python API. In this tutorial, we'll focus on the command line use.

In brief, redbiom allows for:

  • finding samples that contain a specific or set of features
  • finding samples by arbitrary metadata searches
  • summarizing samples over metadata category
  • retrieval of sample data in BIOM-Format
  • discovering metadata categories that exist in the cache
  • pulling out sample data from different processing types (e.g., search for closed reference, retrieve Deblur)

By default, the features from Qiita that are produced by Deblur are represented as the actual sequence variants, and an example at the end of the tutorial shows how these can be extracted for downstream use (e.g., fragment insertion and phylogenetic analyses).

Install

redbiom is not part of the default QIIME 2 distribution. It can be easily installed by running pip install redbiom or by running conda install -c conda-forge redbiom. More information can be found in the Install section of the README.

Technical and biological replicates

redbiom is designed to handle biological and technical replicates. Specifically, it allows for a one to many relationship between a sample's metadata and its data. In order to support this, sample IDs are tagged when loaded into the cache so the sample data can be differentiated between preparations. Internally, redbiom keeps track of these relationships, and it reports "ambiguous" samples when writing out BIOM tables and mapping files. An ambiguous association is where the same physical sample is associated with multiple processing runs as would happen with technical replicates.

Command line help

redbiom can be explored on the command line, and uses a nested command line structure similar to QIIME 2.

$ redbiom --help
Usage: redbiom [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  admin      Update database, etc.
  fetch      Sample data and metadata retrieval.
  search     Feature and sample search support.
  select     Select items based on metadata
  summarize  Summarize things.

Many of the redbiom commands are designed to consume data over standard input, and to dump data to standard output, allowing commands to be "piped" together. An example of this type of operation is at the end of the tutorial.

Contexts

First step with redbiom is to determine what context or contexts you'd like to use. redbiom organizes data into "contexts" to group data processed in a common way. For example, one context might be composed of Illumina 16S v4 data processed by closed reference OTU picking against Greengenes. The motivation for contexts is to group similar data together in order to reduce biases when comparing results. Data are loaded into contexts and searches for samples by feature happen within contexts.

It is handy to list the available contexts (note that the context names are not assured to be stable at this time). In the first column of output, we have the context name which is hopefully human readable. In the second column, we can find the number of samples represented in the context, and the third column describes the number of features. The fourth column is a description that is presently unused.

As an example, the first context listed below is "Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-100nt-a243a1". The human interpretation is that the samples in this context were picked closed reference against Greengenes, all of the samples were run on an Illumina platform, all targeted the v4 region, and all of the sequences were trimmed such that they are all 100 nucleotides long. (The "a243a1" is a arbitrary tag that can be ignored).

$ redbiom summarize contexts
ContextName	SamplesWithData	FeaturesWithData	Description
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-100nt-a243a1	129596	74983	Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v2-41ebc6	3034	24839	Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-18S-v9-150nt-bd7d4d	153	72	Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v45-100nt-a243a1	22	8178	Qiita context
Deblur-NA-illumina-16S-v4-90nt-99d1d8	119538	4460311	Qiita context
Pick_closed-reference_OTUs-Greengenes-titanium-16S-v46-90nt-44feac	215	4328	Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v4-90nt-44feac	116	3109	Qiita context
Pick_closed-reference_OTUs-Greengenes-illumina-16S-v6-8-150nt-bd7d4d	110	4985	Qiita context
Pick_closed-reference_OTUs-Greengenes-flx-16S-v2-100nt-a243a1	3035	24833	Qiita context
...

$ redbiom summarize contexts | wc -l
     106

Any sample data query to redbiom must specify a context in order to obtain the desired data, so it is useful to define an environment variable. Let's use the Deblur context for Illumina 16S v4 data trimmed at 150 nucleotides:

export CTX=Deblur-Illumina-16S-V4-150nt-780653

Fetching samples based on the features they contain, metadata, and more

We will be downloading several files, so before we go further, let's create a directory to contain them.

mkdir querying-redbiom
cd querying-redbiom

First, let's take a look at how many samples are index that also have the "qiita_empo_3" metadata category. This category represents the Earth Microbiome Project ontology, and includes the data from the EMP manuscript as well as inferred EMPO3 category values where possible using existing sample data (e.g., some samples had metadata describing them as fecal, so that sample was inferred to be of an "Animal distal gut" EMPO3 value). redbiom allows us to investigate and summarize the available metadata in a category:

$ redbiom summarize metadata-category --category "qiita_empo_3" --counter
Category value	count
Hypersaline (saline)	13
Mock community	110
Surface (saline)	123
Sediment (non-saline)	442
Plant surface	532
Plant rhizosphere	676
Plant corpus	678
Sediment (saline)	705
Animal proximal gut	1194
Water (saline)	1505
anthropogenic sample	1652
Sterile water blank	2544
Water (non-saline)	4189
Animal secretion	6507
Soil (non-saline)	7376
Surface (non-saline)	8277
Animal corpus	12616
Animal surface	12625
Animal distal gut	36519

Now let's have some fun. We can search for all of the samples containing a given set of features. For instance, let's start by defining a set of interesting sequence variants:

$ cat > features << END
TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGTAGGCGGATTGGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGG
TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAGGGAGCGCAGGCGGTCTGGCAAGTCTGATGTGAAATACCGGGGCTTAACCCCGGAGCTGCATCCAAAACTGTAGTTCTTGAGTGGAGTAGAGGTAAGCGGAATTCCGAGT
TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGAAGTGAAATCTATGGGCTTAACCCATAAACTGCTTTCAAAACTGCTGGTCTTGAGTGATGGAGAGGCAGGCGGAATTCCGTG
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGCGGTAAGACAAGTCAGAAGTGAAAACCCAGGGCTTAACTCTGGGACTGCTTTTGAAACTGTCAGACTGGAGTGCAGGAGAGGTAAGCGGAATTCCTAG
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTGTAAAGGGTGCGTAGACGGGAGAACAAGTTAGTTGTGAAATACCTCGGCTCAACTGAGGAACTGCAACTAAAACTGTACTTCTTGAGTGCAGGAGAGGTAAGTGGAATTACTAG
TACGTAGGTGACAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGCGTAGGCGGGATAGCAAGTCAGTCGTGAAATACCGGAGCTCAACTCCGGGGCTGCGATTGAAACTGTTATTCTTGAGTATCGGAGAGGAAAGCGGAATTCCTGG
TACGTAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGTAGACGGTGATGTAAGTCTGATGTGAAAGCCTCCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCTATG
TACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAGGGCGTGTAGCCGGGAAGGCAAGTCAGATGTGAAATCCATGGGCTCAACCTCCAGCCTGCATTTGAAACTGTAGTTCTTGAGTGTCGGAGAGGCAATCGGAATTCCGTGT
TACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGTGCAGCCGGGCTGACAAGTCAGATGTGAAATCCCGAGGCTTAACCTCGGAACTGCATTTGAAACTGTTAGTCTTGAGTATCGGAGAGGTCATCGGAATTCCTTG
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGAGCGTAGACGGCTGTGCAAGTCTGAAGTGAAAGCCCGGGGCTCAACCCCGGGACTGCTTTGGAAACTGTGCAGCTAGAGTGTCGGAGAGGTAAGCGGAATTCCTAG
END

We can now search for all samples which contain these same features (one or all is supported), and summarize the found samples by their metadata category. In this example, we found at least 10 samples which contain our features of interest, 10 of those samples have the "qiita_empo_3" metadata category described, and all 10 of those samples report as "Animal distal gut" within the "qiita_empo_3" metadata category:

$ redbiom summarize features --from features --context $CTX --category qiita_empo_3
Animal distal gut	10

Total samples	10

We can also get the sample identifiers back::

$ redbiom search features --from features --context $CTX > samples
$ cat samples
36211_2086.1325901136
27052_10317.000005692
31192_1448.HCO07
28104_2086.1325901136
27507_10317.000014551
30841_10323.O41G42.F.0611A
30841_10323.P51Y69.F.0911C
30756_10317.000020624
29082_10317.000053457
26743_1924.Sadowsky.19
30968_10508.BC114.6.fec

IMPORTANT: The identifiers above are delimited by an underscore ("_"), a reserved character for redbiom. IDs of this form in redbiom are used internally so that sample ambiguities can be tracked. The structure is <tag>_<sample-id>. So for example, "31192_1448.HCO07" has the tag "31192", and the actual sample identifier is "1448.HCO07". This will become important in a moment when we obtain sample data and metadata.

We could also restrict our observed samples to only those associated with humans by selecting host_taxid 9606:

$ redbiom select samples-from-metadata --from samples --context $CTX "where host_taxid==9606"
31192_1448.HCO07
27052_10317.000005692
27507_10317.000014551
29082_10317.000053457
30756_10317.000020624
28104_2086.1325901136
36211_2086.1325901136
26743_1924.Sadowsky.19

Finally we can fetch the data for a set of samples across all features. Note that we're pulling all of the samples we found, not just the ones restricted to humans.

$ redbiom fetch samples --from samples --context $CTX --output data.biom
1 sample ambiguities observed. Writing ambiguity mappings to: data.biom.ambiguities

And let's quickly take a look at the data we obtained. Take a note of the sample identifiers: these are not structured like the internal redbiom IDs but are instead re-ordered into the form <sample-id>.<tag>.

$ biom summarize-table -i data.biom
Num samples: 11
Num observations: 4,373
Total count: 226,230
Table density (fraction of non-zero values): 0.117

Counts/sample summary:
 Min: 6,214.000
 Max: 37,983.000
 Median: 20,013.000
 Mean: 20,566.364
 Std. dev.: 10,268.189
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Counts/sample detail:
10317.000020624.30756: 6,214.000
10317.000005692.27052: 7,931.000
10317.000053457.29082: 9,267.000
10508.BC114.6.fec.30968: 11,789.000
1924.Sadowsky.19.26743: 17,564.000
1448.HCO07.31192: 20,013.000
10317.000014551.27507: 28,102.000
2086.1325901136.36211: 28,873.000
2086.1325901136.28104: 28,873.000
10323.P51Y69.F.0911C.30841: 29,621.000
10323.O41G42.F.0611A.30841: 37,983.000

When we obtained the data, redbiom reported a single ambiguity. Any ambiguities observed are stored in a JSON object. The interpretation here is that the physical sample "2086.1325901136" corresponds to two samples with data, possibly biological or technical replicates. Both of the samples are represented in our output BIOM table (...it appears the sample may be a duplicated across preparations as the read count is identical).

$ cat data.biom.ambiguities 
{"2086.1325901136": ["28104_2086.1325901136", "36211_2086.1325901136"]}

Ambiguities arise where multiple data sets exist for the same physical sample, for instance where the same sample was run on multiple preparations (e.g., resequencing). The details are saved in the .ambiguities file.

Fetching metadata

In addition to the sample data, we can also fetch the metadata for a set of samples:

$ redbiom fetch sample-metadata --from samples --context $CTX --output metadata
1 sample ambiguities observed. Writing ambiguity mappings to: metadata.ambiguities
$ cat metadata
#SampleID	body_habitat	body_product	body_site	description	dna_extracted	elevation	env_biome	env_feature	env_package	host_subject_id	host_taxid	latitude	longitude	physical_specimen_location	physical_specimen_remaining	qiita_study_id	sample_type	scientific_name
1924.Sadowsky.19.26743	UBERON:feces	UBERON:feces	UBERON:feces	Day 17 CD1	TRUE	443.82907	urban biome	human-associated habitat	human-gut	CD1	9606	46.72955	-94.6859	UCSD	TRUE	1924	Stool	human gut metagenome
10317.000005692.27052	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	229.4	dense settlement biome	human-associated habitat	human-gut	5a3021c536bd4597726ed37b71d8821ff6a389fa433510b51d2b03fa812fb0f41aac2eb345c7d63554842689d21efc440512f1841f1564543727fc20a6314aaf	9606	-36.8	144.3	UCSDMI	true	10317	Stool	human gut metagenome
10317.000014551.27507	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	78.6	dense settlement biome	human-associated habitat	human-gut	4a41054eaea250cae18ad8e1afbec0711398cfc74cd28af327b61e5aeeabd00ee7f67e08232b72becb27092e107f12da18ae9c9b373a9d001078643ef2ab8811	9606	41.2	-73.9	UCSDMI	true	10317	Stool	human gut metagenome
2086.1325901136.28104	UBERON:feces	UBERON:feces	UBERON:feces	No Additive___Day 0	True	304.32	urban biome	human-associated habitat	human-gut	1006 MMC	9606	44.02	-92.4699	Mayo Clinic	False	2086	stool	human gut metagenome
10317.000053457.29082	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	6.2	dense settlement biome	human-associated habitat	human-gut	66757f949a0e9def83ab0a1537113e585b6fced1aa84d2faf72565922d4fe1cf63dc7f2fcd49866a7c5ac036f37f5f35657439cad2342ed1ef51784171db2b6a	9606	50.7	-3.1	UCSDMI	true	10317	Stool	human gut metagenome
10317.000020624.30756	UBERON:feces	UBERON:feces	UBERON:feces	American Gut Project Stool Sample	true	35.1	dense settlement biome	human-associated habitat	human-gut	258176caaa5a7a3b549c6ef14d3b6cb1261eed0568bd51aad1bb504ec0e9b5e35427e00703f85f55599324f763fc4039e5cc199a48cac0f3c623b2e6d9fc5c25	9606	51.5	-0.1	UCSDMI	true	10317	Stool	human gut metagenome
10323.O41G42.F.0611A.30841	UBERON:feces	UBERON:feces	UBERON:feces	gazelle fecal O41G42-F	true	822.97	tropical grassland biome	animal-associated habitat	host-associateO41G42-F	27591	-0.02356	37.906	U Georgia	false	10323	stool	gut metagenome
10323.P51Y69.F.0911C.30841	UBERON:feces	UBERON:feces	UBERON:feces	gazelle fecal P51Y69-F	true	822.97	tropical grassland biome	animal-associated habitat	host-associateP51Y69-F	27591	-0.02356	37.906	U Georgia	false	10323	stool	gut metagenome
10508.BC114.6.fec.30968	UBERON:feces	UBERON:feces	UBERON:feces	BC114.6.fec	true	33	urban biome	animal-associated habitat	host-associated	BC114	10090	40.742	-73.97399999999998	NYUMC	true	10508	stool	mouse gut metagenome
1448.HCO07.31192	UBERON:feces	UBERON:feces	UBERON:feces	HCO07	true	5102.3	village biome	human-associated habitat	human-gut	HCO07	9606	-12.0	-76.0	OU Lewis lab	false	1448	stool	human gut metagenome
2086.1325901136.36211	UBERON:feces	UBERON:feces	UBERON:feces	No Additive___Day 0	True	304.32	urban biome	human-associated habitat	human-gut	1006 MMC	9606	44.02	-92.4699	Mayo Clinic	False	2086	stool	human gut metagenome

Importing data from redbiom into QIIME 2

Finally, it is straightforward to import a downloaded biom file into QIIME 2 for subsequent analysis or integration:

$ qiime tools import --input-path data.biom --output-path data.qza --type FeatureTable[Frequency]
$ qiime tools peek data.qza
UUID:        bba4f7df-09a7-4955-8572-b20d5d28e175
Type:        FeatureTable[Frequency]
Data format: BIOMV210DirFmt

Searching for samples, and fetching data from a specific study

redbiom indexes the sample metadata on load, and uses natural language processing techniques to allow for arbitrary queries for samples. Perhaps the most useful type of search is to just obtain all the sample data in BIOM from an existing study. In this example, we're obtaining Qiita study 2136:

$ redbiom search metadata "where qiita_study_id == 2136" | redbiom fetch samples --context $CTX --output study.biom

We could, however, search for all samples in which the word "soil" is used somewhere in their metadata, and to further refine the search to return only samples which have a "ph" column where that value is greater than 7:

$ redbiom search metadata "soil where ph > 7" | wc -l
826

We can of course also get the BIOM table data from this result. So let's do that, and in doing so, we're also going to highlight how redbiom commands can be "piped" together:

$ redbiom search metadata "soil where ph > 7" | redbiom fetch samples --context $CTX --output soil_data.biom
$ biom summarize-table -i soil_data.biom | head
Num samples: 300
Num observations: 175,652
Total count: 7,806,041
Table density (fraction of non-zero values): 0.016

Counts/sample summary:
 Min: 9.000
 Max: 63,526.000
 Median: 29,839.500
 Mean: 26,020.137

You may have noticed that the number of samples obtained is much smaller than our initial query. This is most likely because either not all of the 826 samples were run on Illumina, not all of them were 16S v4, or not all of them were long enough be included in a 150nt trim.

Extracting features for phylogenetic analyses

In the Qiita contexts, the features represented in the closed reference contexts map directly into the Greengenes 13_8 tree. In the Deblur contexts, the features contained correspond to the sequence variants themselves. Let's take the soil table we produced above, extract the features, and create a QIIME 2 artifact with a FeatureData[Sequence] semantic type. First, let's take a look at the first 3 features. Note that we're using the "table-ids" subcommand of biom, and we're specifying the --observations parameter to obtain the features:

$ biom table-ids -i soil_data.biom --observations | head -n 3
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTCTGGTGTTTAATCCTGGGGCTCAACTCCGGGTCGCACTGGAAACTGGTAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGT
GTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGCGCAGGCTGCGCGGACAGTCAAATGTGAAATTCAGGGGCTCAACCCCTGCATTGCGCTTGATACTTCCGCGCTCGAGCCTTGG
TACGTAGGGACCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGTGGTAAGTCACTTGTGAAATCTCTGAGCTTAACTCAGAACGGCCAAGTGATACTGCTGTGCTCGAGTGTGGAAGGGGCAATCGGAATTCTTGG

The trick here is that we need to create a FASTA file so the data can be easily understood by QIIME 2. One way we can do this is by writing a very small awk program where we use the feature itself as both the identifier and the sequence. Because the command is getting a little long, we're going to break it up over multiple lines using \:

$ biom table-ids -i soil_data.biom --observations | \
    head -n 3 | \
    awk '{ print ">" $1 "\n" $1 }' 
>TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTCTGGTGTTTAATCCTGGGGCTCAACTCCGGGTCGCACTGGAAACTGGTAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGT
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTCTGGTGTTTAATCCTGGGGCTCAACTCCGGGTCGCACTGGAAACTGGTAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGT
>GTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGCGCAGGCTGCGCGGACAGTCAAATGTGAAATTCAGGGGCTCAACCCCTGCATTGCGCTTGATACTTCCGCGCTCGAGCCTTGG
GTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGCGCAGGCTGCGCGGACAGTCAAATGTGAAATTCAGGGGCTCAACCCCTGCATTGCGCTTGATACTTCCGCGCTCGAGCCTTGG
>TACGTAGGGACCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGTGGTAAGTCACTTGTGAAATCTCTGAGCTTAACTCAGAACGGCCAAGTGATACTGCTGTGCTCGAGTGTGGAAGGGGCAATCGGAATTCTTGG
TACGTAGGGACCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGTGGTAAGTCACTTGTGAAATCTCTGAGCTTAACTCAGAACGGCCAAGTGATACTGCTGTGCTCGAGTGTGGAAGGGGCAATCGGAATTCTTGG

Finally, let's do the same but redirect to a file, and import it into QIIME 2. Note that the head command has been removed so that we can get all of the features:

$ biom table-ids -i soil_data.biom --observations | \
    awk '{ print ">" $1 "\n" $1 }' > soil_data.fa
$ qiime tools import --input-path soil_data.fa \
    --output-path soil_data_rep_seqs.qza \
    --type FeatureData[Sequence]
$ qiime tools peek soil_data_rep_seqs.qza
UUID:        c2f233eb-e777-4fd2-a636-49a6431b0aaa
Type:        FeatureData[Sequence]
Data format: DNASequencesDirectoryFormat

Conclusion

Thank you for taking the time to read through this community tutorial. It was put together by @BenKaehler and @wasade. We hope you find this tool useful, and please do not hesitate to report issues or feature requests so that we can continue to improve it!

Edit history

Oct. 28 2019 - added a mention of using conda for install
Aug. 24 2020 - updated the context used (thanks @hotblast!)

8 Likes

This tutorial is not working starting from this step.
(base) pn1933734:querying-redbiom garyxie$ redbiom summarize features --from features --context $CTX --category qiita_empo_3
Traceback (most recent call last):
File “/Users/garyxie/opt/anaconda3/bin/redbiom”, line 10, in
sys.exit(cli())
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 764, in call
return self.main(*args, **kwargs)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 717, in main
rv = self.invoke(ctx)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/click/core.py”, line 555, in invoke
return callback(*args, **kwargs)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/redbiom/commands/summarize.py”, line 158, in summarize_features
iterable, exact)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/redbiom/summarize.py”, line 65, in category_from_features
redbiom._requests.valid(context)
File “/Users/garyxie/opt/anaconda3/lib/python3.7/site-packages/redbiom/_requests.py”, line 179, in valid
raise ValueError(“Unknown context: %s” % context)
ValueError: Unknown context: Deblur-NA-illumina-16S-v4-150nt-780653

1 Like

Thanks, @hotblast! The context names have been revised since this tutorial was written. Deblur-Illumina-16S-V4-150nt-780653 should be the correct one now.

Thanks @wasade for your tutorial.
I am blocking at the
redbiom summarize features --from features --context $CTX --category qiita_empo_3
I get a result of 0
Even when I define other features such as 1 single sequence, it is not found in the $CTX.
(I used export -p to verify that CTX is stored, and cat features to verify the features is stored).
Another question: How can I obtain the list of all the metadata-categories available?
Thanks again!

1 Like

Thanks, @Microbio_Spelman! Could you send an example of a feature that is being searched?

Best,
Daniel

1 Like

…to your second question, @Microbio_Spelman, I don’t think there is a means right now to dump all of the available categories but I could be misremembering. Any chance you could open an issue on the redbiom github tracker?

That said, you can search for metadata categories with redbiom search metadata --categories <query>. For example:

$ redbiom search metadata --categories age | head
host_age_units
solid_start_approx_age_months
age_binary
irrit_bowel_syndrome_age
nest_age_1
mother_age
child_1_age_units
age_at_death_units
host_age
age_in_years
1 Like

Thanks!
Acutally I typed
redbiom summarize metadata
and after a few hours obtained a whole list of categories, with the counts. So that seemed to have worked.
I believe your strategy is likely more efficient. I will try that.

1 Like

I used the list of features that you stated in the tutorial.
(the CTX is the one you defined through
export CTX=Deblur-Illumina-16S-V4-150nt-780653)

I also created (with the 1st sequence in your features list):
cat > features2 << END
TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGTAGGCGGATTGGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGG
END
and still obtained 0 counts.
I that case I used the code:
redbiom summarize features --from features2 --context $CTX --category "qiita_empo_3"

I tried with several other ones as well, without success.

1 Like

Hi @Microbio_Spelman, that feature was only observed in a single sample. And that does not appear to have the qiita_empo_3 variable associated (although I do see empo_3)

$ f=TACGTAGGTGGCGAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGTAGGCGGATTGGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGG
$ ctx=Deblur-Illumina-16S-V4-150nt-780653
$ echo $f | redbiom search features --context $ctx
56964_1924.Sadowsky.19
$ echo $f | redbiom search features --context $ctx | redbiom fetch sample-metadata --output example.txt
$ cat example.txt
#SampleID	animations_gradient	animations_subject	anonymized_name	collection_timestamp	day_relative_to_fmt	description	disease_state	dna_extracted	elevationempo_1	empo_2	empo_3	env_biome	env_feature	env_material	env_package	geo_loc_name	height_or_length	host_age	host_age_units	host_body_habitat	host_body_mass_index	host_body_product	host_body_site	host_common_name	host_height	host_height_units	host_scientific_name	host_subject_id	host_taxid	host_weight	host_weight_units	latitude	longitude	physical_specimen_location	physical_specimen_remaining	public	qiita_study_id	race	sample_location	sample_type	scientific_name	sex	taxon_id	title	tot_mass
1924.Sadowsky.19	17	CD1	Sadowsky.19	2011-08-12	17	Day 17 CD1	post-FMT	TRUE	440.82907	Host-associated	Animal	Animal distal gut	urban biome	human-associated habitat	feces	human-gut	USA:MN	1.651	39	years	UBERON:feces	29.3	UBERON:feces	UBERON:feces	human	1.651	m	Homo sapiens	CD1	9606	80.1	kg	46.72955	-94.6859	UCSD	TRUE	TRUE	1924	white	UCSD	Stool	human gut metagenome	female	408170	sadowsky_CDdiff_transplants	80.1
3 Likes

Hi @wasade,
I am currently trying to run this command:
redbiom summarize features --from ~/Downloads/fd959588-db42-4de4-899f-68fa0938a038/data/dna-sequences.fasta --context Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b --category qiita_empo_3

and I am getting this error: raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I am not exactly sure what I am doing wrong: I was able to successfully run the search features command and was able to get a list of samples
redbiom search features --from ~/Downloads/fd959588-db42-4de4-899f-68fa0938a038/data/dna-sequences.fasta --context Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b >samples

I tried to run the summarize samples command and got the same json error:

redbiom summarize samples --from samples --category qiita_empo_3

I was also able to run

redbiom fetch sample-metadata --from samples --output metadata

However it only had sample ids and qiita study id in the metadata.

Do you have any advice for fixing this error?
Thank you!

Hi @cherman2,

Any chance you could share the output of:

$ head ~/Downloads/fd959588-db42-4de4-899f-68fa0938a038/data/dna-sequences.fasta

I can then run some of the sequences and see about reproducing this locally

Best,
Daniel

Hey @wasade,
I had a co-worker run the successfully command on my sequences. I guess I have something in my conda packages that are causing this error.
Thanks for the help!

1 Like

Very weird, well I'm glad it was sorted out!

1 Like

I am trying to build a human-stool.qza to create a weighted classiffier on qiime2.

First I ran

redbiom search metadata "where host_taxid==9606 and (sample_type=='stool' or sample_type=='Stool')" > samples

Then I ran the next step having this error

redbiom fetch samples --from samples --context $CTX --output data.biom
Traceback (most recent call last):
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/bin/redbiom", line 8, in <module>
    sys.exit(cli())
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/redbiom/commands/fetch.py", line 225, in fetch_samples_from_samples
    table, ambig = redbiom.fetch.data_from_samples(context, iterable,
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/redbiom/fetch.py", line 306, in data_from_samples
    return _biom_from_samples(context, samples, skip_taxonomy=skip_taxonomy)
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/redbiom/fetch.py", line 399, in _biom_from_samples
    table.update_ids(rimap)
  File "/media/microviable/d/miniconda/qiime2-shotgun-2023.9/lib/python3.8/site-packages/biom/table.py", line 1406, in update_ids
    str_dtype = 'U%d' % max([len(v) for v in id_map.values()])
ValueError: max() arg is an empty sequence

any idea? Thanks

1 Like

Thanks, @imonteroo! Do you recall what context was being used for the fetch?

An important but difficult set of TODOs I have is to get back to redbiom and improve its error handling. My guess here is the context used doesn't contain any of the relevant samples.

Best,
Daniel

1 Like