Summary of changes to Metadata in QIIME 2 2018.2
release
There are some exciting changes to Metadata in QIIME 2 included in the 2018.2
release. This forum topic summarizes some of the more noticeable changes for both user and developer audiences. Click here to see a complete description of the new Metadata file format, along with example data and a tutorial.
If you have any questions or feedback on these changes, please create new forum topic(s) and we’ll get in touch. Thanks!
New Metadata file format
These changes affect all users and developers interacting with QIIME 2 Metadata files.
There is a new Metadata file format specification which builds upon the previous file format and irons out some undefined behavior and bugs that users have come across. Click here to see a complete description of the new Metadata file format, along with example data and a tutorial.
The new file format is mostly backwards-compatible with the previous format, and chances are that your existing Metadata files will continue to work with the new format. Also, the new format is designed to be backwards-compatible with QIIME 1 mapping files, Qiita sample/prep templates, and biom-format observation metadata files. To see if your existing Metadata files are supported with the new file format, simply try using them with QIIME 2, or validate them with Keemei – an error will be raised by QIIME 2 if the file isn’t supported (see the section Metadata validation below for details).
However, there are some important changes to be aware of that affect how your metadata are interpreted in QIIME 2. Please be aware of the following changes to ensure that your metadata is interpreted in the way you’re expecting.
Here are the highlights of the new format. For a complete description of the format, check out the Metadata docs.
Required header
The previous format did not specify a required header (i.e. a row denoting column names in the file). The lack of a required header led to many cases of unintended behavior when the files were used in QIIME 2 (most of these cases were brought up by forum users). By not having a required header, it is very easy to forget to include a header in the file, which would cause the first sample ID (or feature ID) to be used as the header, effectively ignoring that sample or feature ID in analyses.
This issue was most pronounced when performing metadata-based filtering of feature tables, sequences, distance matrices, etc. (e.g. using qiime feature-table filter-samples
, qiime feature-table filter-features
, etc.). It is very easy to create a simple Metadata file to perform ID-based filtering and forget to include a header. The net effect was that filtering ignored the first ID in the file because that ID was interpreted as a header (and without a required header, QIIME 2 has no way of distinguishing between the presence or absence of a header). This lead to incorrect filtering behavior from the user’s perspective.
To alleviate these issues, the new format has a very minimal required header. Only the first column in the file (which contains the sample/feature IDs) is required to have a specific column name. The first column may have one of the following values:
Case-insensitive:
id
sampleid
sample id
sample-id
featureid
feature id
feature-id
Case-sensitive (these are mostly for backwards-compatibility with QIIME 1, biom-format, and Qiita files):
#SampleID
#Sample ID
#OTUID
#OTU ID
sample_name
If your metadata file’s first column doesn’t match one of the names listed above, an error will be raised when the file is loaded in QIIME 2.
Missing data
The previous format didn’t describe how to store missing data in Metadata files. See this forum topic for more details about how missing data were previously interpreted in QIIME 2, along with some discussion about missing data support in the new format.
Storing missing data in your metadata is simple: just use an empty cell! Values like NA
, nan
, etc. are no longer interpreted as missing data.
Column typing
Metadata column types (i.e. numeric vs categorical data) were previously inferred by QIIME 2. For example, if a column consisted only of numbers, the column would be inferred to be numeric. Users had no way to override this inference to state that the column is actually categorical data. For example, a Subject
column where subjects are labeled 1
, 2
, 3
should be treated categorically and not numerically, but there was no way to override that behavior with the previous format. The workaround was to create a new column containing non-numeric values denoting subjects, which would cause QIIME 2 to infer the column type as categorical data.
With the new format, QIIME 2 will continue to infer column types in the same way as before. However, the new format supports a special comment directive that allows users to specify a column’s type, avoiding the inference described above.
The column typing comment directive is entirely optional, and if it is present, you don’t have to specify a type for each column. This makes it easy to fill in column types as necessary; if a column’s type isn’t declared, the type will be inferred as usual.
The comment directive must appear directly below the header and the first cell must be labelled #q2:types
. Subsequent cells may be labelled categorical
, numeric
, or left empty to have the type inferred.
Here is a simple example:
#SampleID | Subject | BodySite | DaysSinceExperimentStart |
---|---|---|---|
#q2:types | categorical | categorical | numeric |
sample-1 | 1 | gut | 20 |
sample-2 | 1 | tongue | 25 |
sample-3 | 2 | gut | 15 |
sample-4 | 2 | tongue | 42 |
Here we have two categorical columns (Subject
and BodySite
) and one numeric column (DaysSinceExperimentStart
). Using the column typing comment directive, we are able to override QIIME 2’s inference for the Subject
column by stating that the column is categorical data. If the Subject
column wasn’t labelled as categorical
, it would be interpreted as numeric data, which is probably not what the user intended.
Name change: metadata column vs category
In previous versions of QIIME 2 and QIIME 1, metadata columns were often referred to as metadata categories. Now that we support metadata column typing, which allows you to say whether a column contains numeric or categorical data, we would end up using terms like categorical metadata category or numeric metadata category, which can be confusing. We now avoid using the term category unless it is used in the context of categorical metadata.
This name change may be most noticeable for CLI users. Previously, any QIIME 2 action accepting a metadata column as input would have an option name ending with -category
. The new option names end with -column
, so previous commands will not work with the new naming scheme. For example, if a command previously used an option called --m-metadata-category
, the new option name will be --m-metadata-column
. If the CLI detects usage of the older option names, it will error and describe how to update your command to use the new option names.
Metadata validation
Metadata validation works the same as before: sample and feature metadata files stored in Google Sheets can be validated using Keemei.
QIIME 2 will also automatically validate a metadata file anytime it is used by the software. However, using Keemei to validate your metadata is recommended because a report of all validation errors and warnings will be presented each time Keemei is run. Loading your metadata in QIIME 2 will typically present only a single error at a time, which can make identifying and resolving validation issues cumbersome, especially if there are many issues with the metadata.
Metadata API changes
These changes affect Artifact API users and plugin/interface developers interacting with QIIME 2 Metadata.
There are many backwards-incompatible changes to the Metadata API. The changes listed below are not exhaustive; only the more pronounced changes are noted. If you have questions about updating your code to use the new API, please create new forum topic(s) and we’ll get in touch. Since there are so many API changes, it is likely that your code (plugins, interfaces, Artifact API) will need to be updated to continue functioning with the 2018.2
release.
Design overview
The qiime2.Metadata
class continues to exist and has an updated API. The Metadata
object is composed of zero or more qiime2.MetadataColumn
objects.
The qiime2.MetadataCategory
class has been renamed to qiime2.MetadataColumn
, which is an abstract base class (ABC) and cannot be instantiated directly. There are two subclasses to represent categorical and numeric metadata columns: qiime2.CategoricalMetadataColumn
and qiime2.NumericMetadataColumn
, respectively. These concrete subclasses are the objects your code will interact with at runtime.
Metadata
objects
When using qiime2.Metadata
objects, here are some of the bigger changes to be aware of:
-
Metadata.load()
has an optionalcolumn_types
parameter, which allows you to override the metadata column types that are declared or inferred from the file being loaded. This is useful if you wish to programmatically override column types at runtime without having to modify the metadata file. -
There is a new method,
Metadata.save()
, for saving a Metadata file to disk in TSV format. The Metadata file will be written with as much detail as possible to ensure that the file is roundtrippable. For example, a column types comment directive will always be written to indicate what column types were used at runtime, in order to make analyses reproducible without relying on column type inference. -
Metadata.from_artifact()
has been removed in favor ofArtifact.view(Metadata)
, which matches the usual way of obtaining a particular view type from anArtifact
(Metadata
is no longer a special case in that regard). -
The
Metadata
constructor continues to accept apandas.DataFrame
object, with the following requirements and considerations:-
The dataframe’s index name (
df.index.name
) must match one of the required headers listed in the file format section above. An index name ofNone
(the pandas default) is no longer accepted. It’s easy to set an index name when creating an index object on the dataframe, e.g.pd.Index([...], name='id')
. -
If a column in the dataframe is
dtype=object
, it may contain strings or pandas missing values (e.g.np.nan
,None
). Columns matching this requirement are assumed to be categorical. Type casting/inference does not take place within the constructor currently. -
If a column in the dataframe is
dtype=float
ordtype=int
, it may contain floating point numbers or integers, as well as pandas missing values (e.g.np.nan
). Columns matching this requirement are assumed to be numeric. -
Regardless of column type (categorical vs numeric), the dataframe stored within the
Metadata
object will have any missing values normalized to usenp.nan
. Columns withdtype=int
will be cast todtype=float
. To obtain a dataframe fromMetadata
containing these normalized data types and values, useMetadata.to_dataframe()
. -
Since
Metadata
stores typed columns (dtype=object
containing strings, ordtype=float
containing numbers), there is no need to manually cast columns to the appropriatedtype
anymore when using a dataframe obtained fromMetadata.to_dataframe()
. -
There are other restrictions placed on the values in the input dataframe that would prevent the file written by
Metadata.save()
from being roundtrippable. For example, IDs can’t begin with a pound sign (#
) because the written out file would have that row represented as a comment line instead of a row containing an ID.
-
-
Metadata.columns
is a read-onlyOrderedDict
object mapping metadata column names toColumnProperties
namedtuples. TheColumnProperties
namedtuple currently stores a single field called.type
, which provides the column’s type as a string (either'categorical'
or'numeric'
). SinceMetadata.columns
is an ordered mapping, iterating over the columns will maintain the order of columns in the Metadata object. SeeMetadata.get_column()
for retrieving aMetadataColumn
object by column name. -
Metadata.get_category()
has been renamed toMetadata.get_column()
. The method retrieves aMetadataColumn
object (eitherCategoricalMetadataColumn
orNumericMetadataColumn
) based on the column name that’s requested. -
The
Metadata.ids()
method has been renamed toMetadata.get_ids()
. Its interface remains the same. -
The
Metadata.ids
property returns a tuple of IDs. -
The
Metadata.id_count
andMetadata.column_count
properties provide access to the Metadata table dimensions. -
The
Metadata.id_header
property returns the name specified in the input dataframe’s index (this corresponds to the first column’s name in the metadata file if loading from disk). -
Metadata.filter()
has been replaced byMetadata.filter_ids()
andMetadata.filter_columns()
. These methods perform ID-based and column-based filtering, respectively. Both methods return a filtered version of theMetadata
object.
MetadataColumn
objects
When using qiime2.MetadataColumn
objects (i.e. CategoricalMetadataColumn
and NumericMetadataColumn
), here are some of the bigger changes to be aware of:
-
MetadataColumn.load()
has been removed in favor of usingMetadata.load()
to obtain aMetadata
object, followed by callingMetadata.get_column()
to retrieve a column object. -
MetadataColumn.from_artifact()
has been removed in favor of usingArtifact.view(Metadata)
to obtain aMetadata
object, followed by callingMetadata.get_column()
to retrieve a column object. -
The
CategoricalMetadataColumn
andNumericMetadataColumn
constructors continue to accept apandas.Series
object, with the same requirements and considerations listed above for theMetadata
constructor. The series object must have its.name
property set to a valid metadata column name (the pandas default ofNone
is no longer accepted). The series index name must also be one of the required ID header values listed earlier in this post. -
There are several new methods and properties on the
MetadataColumn
object. Many of these APIs are similar to the methods/properties on theMetadata
object described above.New properties:
name
,type
,ids
,id_count
,id_header
New methods:
save()
,to_dataframe()
,get_ids()
,filter_ids()
,get_value()
,has_missing_values()
,drop_missing_values()
Plugin registration
In addition to the Metadata
and MetadataColumn
API changes described above, the way that plugins register MetadataColumn
parameters is a bit different.
Note: If your plugin’s action accepts a qiime2.Metadata
object, register your function in the same way as before. For example, if you have a method called foo
that accepts a qiime2.Metadata
object as input:
# function declaration
import qiime2
def foo(metadata: qiime2.Metadata):
...
# plugin registration
import qiime2.plugin
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': qiime2.plugin.Metadata},
...
)
Now let’s consider a method foo
that accepts a MetadataColumn
as input. Since metadata column are typed, you’ll have to decide whether your action accepts a CategoricalMetadataColumn
, NumericMetadataColumn
, or either type as input.
Let’s have the foo
method accept a NumericMetadataColumn
as input. Here’s what that might look like:
# function declaration
import qiime2
def foo(metadata: qiime2.NumericMetadataColumn):
...
# plugin registration
from qiime2.plugin import MetadataColumn, Numeric
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': MetadataColumn[Numeric]},
...
)
Similarly, if the method accepts a CategoricalMetadataColumn
:
# function declaration
import qiime2
def foo(metadata: qiime2.CategoricalMetadataColumn):
...
# plugin registration
from qiime2.plugin import MetadataColumn, Categorical
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': MetadataColumn[Categorical]},
...
)
Finally, if the method accepts either a numeric or categorical column:
# function declaration
import qiime2
def foo(metadata: qiime2.MetadataColumn):
# Note that the MetadataColumn ABC is used in the function annotation.
# At runtime, the object received as input will be either
# `NumericMetadataColumn` or `CategoricalMetadataColumn`.
# You can differentiate between object types using `isinstance()`.
# For example:
if isinstance(metadata, qiime2.NumericMetadataColumn):
...
elif isinstance(metadata, qiime2.CategoricalMetadataColumn):
...
else:
raise NotImplementedError()
# plugin registration
from qiime2.plugin import MetadataColumn, Numeric, Categorical
plugin.methods.register_function(
function=foo,
inputs={},
parameters={'metadata': MetadataColumn[Numeric | Categorical]},
...
)
Additional resources
All of the plugins and interfaces included in the QIIME 2 Core 2018.2
Distribution have been updated to use the new Metadata API. In addition to the documentation in this post, looking at plugin or interface source code will provide concrete examples of the new Metadata API in action! Take a look at any of the repositories under the qiime2 GitHub organization that utilize QIIME 2 Metadata. q2-diversity and q2-feature-table are good examples of plugins using Metadata, and q2cli is an example of a QIIME 2 interface that leverages Metadata.