Summary of changes to Metadata in QIIME 2 2018.2 release
There are some exciting changes to Metadata in QIIME 2 included in the 2018.2 release. This forum topic summarizes some of the more noticeable changes for both user and developer audiences. Click here to see a complete description of the new Metadata file format, along with example data and a tutorial.
If you have any questions or feedback on these changes, please create new forum topic(s) and we'll get in touch. Thanks! ![]()
New Metadata file format
These changes affect all users and developers interacting with QIIME 2 Metadata files.
There is a new Metadata file format specification which builds upon the previous file format and irons out some undefined behavior and bugs that users have come across. Click here to see a complete description of the new Metadata file format, along with example data and a tutorial.
The new file format is mostly backwards-compatible with the previous format, and chances are that your existing Metadata files will continue to work with the new format. Also, the new format is designed to be backwards-compatible with QIIME 1 mapping files, Qiita sample/prep templates, and biom-format observation metadata files. To see if your existing Metadata files are supported with the new file format, simply try using them with QIIME 2, or validate them with Keemei -- an error will be raised by QIIME 2 if the file isn't supported (see the section Metadata validation below for details).
However, there are some important changes to be aware of that affect how your metadata are interpreted in QIIME 2. Please be aware of the following changes to ensure that your metadata is interpreted in the way you're expecting.
Here are the highlights of the new format. For a complete description of the format, check out the Metadata docs.
Required header
The previous format did not specify a required header (i.e. a row denoting column names in the file). The lack of a required header led to many cases of unintended behavior when the files were used in QIIME 2 (most of these cases were brought up by forum users). By not having a required header, it is very easy to forget to include a header in the file, which would cause the first sample ID (or feature ID) to be used as the header, effectively ignoring that sample or feature ID in analyses.
This issue was most pronounced when performing metadata-based filtering of feature tables, sequences, distance matrices, etc. (e.g. using qiime feature-table filter-samples, qiime feature-table filter-features, etc.). It is very easy to create a simple Metadata file to perform ID-based filtering and forget to include a header. The net effect was that filtering ignored the first ID in the file because that ID was interpreted as a header (and without a required header, QIIME 2 has no way of distinguishing between the presence or absence of a header). This lead to incorrect filtering behavior from the user's perspective.
To alleviate these issues, the new format has a very minimal required header. Only the first column in the file (which contains the sample/feature IDs) is required to have a specific column name. The first column may have one of the following values:
Case-insensitive:
idsampleidsample idsample-idfeatureidfeature idfeature-id
Case-sensitive (these are mostly for backwards-compatibility with QIIME 1, biom-format, and Qiita files):
#SampleID#Sample ID#OTUID#OTU IDsample_name
If your metadata file's first column doesn't match one of the names listed above, an error will be raised when the file is loaded in QIIME 2.
Missing data
The previous format didn't describe how to store missing data in Metadata files. See this forum topic for more details about how missing data were previously interpreted in QIIME 2, along with some discussion about missing data support in the new format.
Storing missing data in your metadata is simple: just use an empty cell! Values like NA, nan, etc. are no longer interpreted as missing data.
Column typing
Metadata column types (i.e. numeric vs categorical data) were previously inferred by QIIME 2. For example, if a column consisted only of numbers, the column would be inferred to be numeric. Users had no way to override this inference to state that the column is actually categorical data. For example, a Subject column where subjects are labeled 1, 2, 3 should be treated categorically and not numerically, but there was no way to override that behavior with the previous format. The workaround was to create a new column containing non-numeric values denoting subjects, which would cause QIIME 2 to infer the column type as categorical data.
With the new format, QIIME 2 will continue to infer column types in the same way as before. However, the new format supports a special comment directive that allows users to specify a column's type, avoiding the inference described above.
The column typing comment directive is entirely optional, and if it is present, you don't have to specify a type for each column. This makes it easy to fill in column types as necessary; if a column's type isn't declared, the type will be inferred as usual.
The comment directive must appear directly below the header and the first cell must be labelled #q2:types. Subsequent cells may be labelled categorical, numeric, or left empty to have the type inferred.
Here is a simple example:
| #SampleID | Subject | BodySite | DaysSinceExperimentStart |
|---|---|---|---|
| #q2:types | categorical | categorical | numeric |
| sample-1 | 1 | gut | 20 |
| sample-2 | 1 | tongue | 25 |
| sample-3 | 2 | gut | 15 |
| sample-4 | 2 | tongue | 42 |
Here we have two categorical columns (Subject and BodySite) and one numeric column (DaysSinceExperimentStart). Using the column typing comment directive, we are able to override QIIME 2's inference for the Subject column by stating that the column is categorical data. If the Subject column wasn't labelled as categorical, it would be interpreted as numeric data, which is probably not what the user intended.
Name change: metadata column vs category
In previous versions of QIIME 2 and QIIME 1, metadata columns were often referred to as metadata categories. Now that we support metadata column typing, which allows you to say whether a column contains numeric or categorical data, we would end up using terms like categorical metadata category or numeric metadata category, which can be confusing. We now avoid using the term category unless it is used in the context of categorical metadata.
This name change may be most noticeable for CLI users. Previously, any QIIME 2 action accepting a metadata column as input would have an option name ending with -category. The new option names end with -column, so previous commands will not work with the new naming scheme. For example, if a command previously used an option called --m-metadata-category, the new option name will be --m-metadata-column. If the CLI detects usage of the older option names, it will error and describe how to update your command to use the new option names.
Metadata validation
Metadata validation works the same as before: sample and feature metadata files stored in Google Sheets can be validated using Keemei.
QIIME 2 will also automatically validate a metadata file anytime it is used by the software. However, using Keemei to validate your metadata is recommended because a report of all validation errors and warnings will be presented each time Keemei is run. Loading your metadata in QIIME 2 will typically present only a single error at a time, which can make identifying and resolving validation issues cumbersome, especially if there are many issues with the metadata.
Metadata API changes
These changes affect Artifact API users and plugin/interface developers interacting with QIIME 2 Metadata.
There are many backwards-incompatible changes to the Metadata API. The changes listed below are not exhaustive; only the more pronounced changes are noted. If you have questions about updating your code to use the new API, please create new forum topic(s) and we'll get in touch. Since there are so many API changes, it is likely that your code (plugins, interfaces, Artifact API) will need to be updated to continue functioning with the 2018.2 release.
Design overview
The qiime2.Metadata class continues to exist and has an updated API. The Metadata object is composed of zero or more qiime2.MetadataColumn objects.
The qiime2.MetadataCategory class has been renamed to qiime2.MetadataColumn, which is an abstract base class (ABC) and cannot be instantiated directly. There are two subclasses to represent categorical and numeric metadata columns: qiime2.CategoricalMetadataColumn and qiime2.NumericMetadataColumn, respectively. These concrete subclasses are the objects your code will interact with at runtime.
Metadata objects
When using qiime2.Metadata objects, here are some of the bigger changes to be aware of:
-
Metadata.load()has an optionalcolumn_typesparameter, which allows you to override the metadata column types that are declared or inferred from the file being loaded. This is useful if you wish to programmatically override column types at runtime without having to modify the metadata file. -
There is a new method,
Metadata.save(), for saving a Metadata file to disk in TSV format. The Metadata file will be written with as much detail as possible to ensure that the file is roundtrippable. For example, a column types comment directive will always be written to indicate what column types were used at runtime, in order to make analyses reproducible without relying on column type inference. -
Metadata.from_artifact()has been removed in favor ofArtifact.view(Metadata), which matches the usual way of obtaining a particular view type from anArtifact(Metadatais no longer a special case in that regard). -
The
Metadataconstructor continues to accept apandas.DataFrameobject, with the following requirements and considerations:-
The dataframe's index name (
df.index.name) must match one of the required headers listed in the file format section above. An index name ofNone(the pandas default) is no longer accepted. It's easy to set an index name when creating an index object on the dataframe, e.g.pd.Index([...], name='id'). -
If a column in the dataframe is
dtype=object, it may contain strings or pandas missing values (e.g.np.nan,None). Columns matching this requirement are assumed to be categorical. Type casting/inference does not take place within the constructor currently. -
If a column in the dataframe is
dtype=floatordtype=int, it may contain floating point numbers or integers, as well as pandas missing values (e.g.np.nan). Columns matching this requirement are assumed to be numeric. -
Regardless of column type (categorical vs numeric), the dataframe stored within the
Metadataobject will have any missing values normalized to usenp.nan. Columns withdtype=intwill be cast todtype=float. To obtain a dataframe fromMetadatacontaining these normalized data types and values, useMetadata.to_dataframe(). -
Since
Metadatastores typed columns (dtype=objectcontaining strings, ordtype=floatcontaining numbers), there is no need to manually cast columns to the appropriatedtypeanymore when using a dataframe obtained fromMetadata.to_dataframe(). -
There are other restrictions placed on the values in the input dataframe that would prevent the file written by
Metadata.save()from being roundtrippable. For example, IDs can't begin with a pound sign (#) because the written out file would have that row represented as a comment line instead of a row containing an ID.
-
-
Metadata.columnsis a read-onlyOrderedDictobject mapping metadata column names toColumnPropertiesnamedtuples. TheColumnPropertiesnamedtuple currently stores a single field called.type, which provides the column's type as a string (either'categorical'or'numeric'). SinceMetadata.columnsis an ordered mapping, iterating over the columns will maintain the order of columns in the Metadata object. SeeMetadata.get_column()for retrieving aMetadataColumnobject by column name. -
Metadata.get_category()has been renamed toMetadata.get_column(). The method retrieves aMetadataColumnobject (eitherCategoricalMetadataColumnorNumericMetadataColumn) based on the column name that's requested. -
The
Metadata.ids()method has been renamed toMetadata.get_ids(). Its interface remains the same. -
The
Metadata.idsproperty returns a tuple of IDs. -
The
Metadata.id_countandMetadata.column_countproperties provide access to the Metadata table dimensions. -
The
Metadata.id_headerproperty returns the name specified in the input dataframe's index (this corresponds to the first column's name in the metadata file if loading from disk). -
Metadata.filter()has been replaced byMetadata.filter_ids()andMetadata.filter_columns(). These methods perform ID-based and column-based filtering, respectively. Both methods return a filtered version of theMetadataobject.
MetadataColumn objects
When using qiime2.MetadataColumn objects (i.e. CategoricalMetadataColumn and NumericMetadataColumn), here are some of the bigger changes to be aware of:
-
MetadataColumn.load()has been removed in favor of usingMetadata.load()to obtain aMetadataobject, followed by callingMetadata.get_column()to retrieve a column object. -
MetadataColumn.from_artifact()has been removed in favor of usingArtifact.view(Metadata)to obtain aMetadataobject, followed by callingMetadata.get_column()to retrieve a column object. -
The
CategoricalMetadataColumnandNumericMetadataColumnconstructors continue to accept apandas.Seriesobject, with the same requirements and considerations listed above for theMetadataconstructor. The series object must have its.nameproperty set to a valid metadata column name (the pandas default ofNoneis no longer accepted). The series index name must also be one of the required ID header values listed earlier in this post. -
There are several new methods and properties on the
MetadataColumnobject. Many of these APIs are similar to the methods/properties on theMetadataobject described above.New properties:
name,type,ids,id_count,id_headerNew methods:
save(),to_dataframe(),get_ids(),filter_ids(),get_value(),has_missing_values(),drop_missing_values()
Plugin registration
In addition to the Metadata and MetadataColumn API changes described above, the way that plugins register MetadataColumn parameters is a bit different.
Note: If your plugin's action accepts a qiime2.Metadata object, register your function in the same way as before. For example, if you have a method called foo that accepts a qiime2.Metadata object as input:
# function declaration
import qiime2
def foo(metadata: qiime2.Metadata):
...
# plugin registration
import qiime2.plugin
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': qiime2.plugin.Metadata},
...
)
Now let's consider a method foo that accepts a MetadataColumn as input. Since metadata column are typed, you'll have to decide whether your action accepts a CategoricalMetadataColumn, NumericMetadataColumn, or either type as input.
Let's have the foo method accept a NumericMetadataColumn as input. Here's what that might look like:
# function declaration
import qiime2
def foo(metadata: qiime2.NumericMetadataColumn):
...
# plugin registration
from qiime2.plugin import MetadataColumn, Numeric
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': MetadataColumn[Numeric]},
...
)
Similarly, if the method accepts a CategoricalMetadataColumn:
# function declaration
import qiime2
def foo(metadata: qiime2.CategoricalMetadataColumn):
...
# plugin registration
from qiime2.plugin import MetadataColumn, Categorical
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': MetadataColumn[Categorical]},
...
)
Finally, if the method accepts either a numeric or categorical column:
# function declaration
import qiime2
def foo(metadata: qiime2.MetadataColumn):
# Note that the MetadataColumn ABC is used in the function annotation.
# At runtime, the object received as input will be either
# `NumericMetadataColumn` or `CategoricalMetadataColumn`.
# You can differentiate between object types using `isinstance()`.
# For example:
if isinstance(metadata, qiime2.NumericMetadataColumn):
...
elif isinstance(metadata, qiime2.CategoricalMetadataColumn):
...
else:
raise NotImplementedError()
# plugin registration
from qiime2.plugin import MetadataColumn, Numeric, Categorical
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': MetadataColumn[Numeric | Categorical]},
...
)
Additional resources
All of the plugins and interfaces included in the QIIME 2 Core 2018.2 Distribution have been updated to use the new Metadata API. In addition to the documentation in this post, looking at plugin or interface source code will provide concrete examples of the new Metadata API in action! Take a look at any of the repositories under the qiime2 GitHub organization that utilize QIIME 2 Metadata. q2-diversity and q2-feature-table are good examples of plugins using Metadata, and q2cli is an example of a QIIME 2 interface that leverages Metadata.