Summary of changes to Metadata in QIIME 2 2018.2 release

Summary of changes to Metadata in QIIME 2 2018.2 release

There are some exciting changes to Metadata in QIIME 2 included in the 2018.2 release. This forum topic summarizes some of the more noticeable changes for both user and developer audiences. Click here to see a complete description of the new Metadata file format, along with example data and a tutorial.

If you have any questions or feedback on these changes, please create new forum topic(s) and we’ll get in touch. Thanks! :sunny:

New Metadata file format

These changes affect all users and developers interacting with QIIME 2 Metadata files.

There is a new Metadata file format specification which builds upon the previous file format and irons out some undefined behavior and bugs that users have come across. Click here to see a complete description of the new Metadata file format, along with example data and a tutorial.

The new file format is mostly backwards-compatible with the previous format, and chances are that your existing Metadata files will continue to work with the new format. Also, the new format is designed to be backwards-compatible with QIIME 1 mapping files, Qiita sample/prep templates, and biom-format observation metadata files. To see if your existing Metadata files are supported with the new file format, simply try using them with QIIME 2, or validate them with Keemei – an error will be raised by QIIME 2 if the file isn’t supported (see the section Metadata validation below for details).

However, there are some important changes to be aware of that affect how your metadata are interpreted in QIIME 2. Please be aware of the following changes to ensure that your metadata is interpreted in the way you’re expecting.

Here are the highlights of the new format. For a complete description of the format, check out the Metadata docs.

Required header

The previous format did not specify a required header (i.e. a row denoting column names in the file). The lack of a required header led to many cases of unintended behavior when the files were used in QIIME 2 (most of these cases were brought up by forum users). By not having a required header, it is very easy to forget to include a header in the file, which would cause the first sample ID (or feature ID) to be used as the header, effectively ignoring that sample or feature ID in analyses.

This issue was most pronounced when performing metadata-based filtering of feature tables, sequences, distance matrices, etc. (e.g. using qiime feature-table filter-samples, qiime feature-table filter-features, etc.). It is very easy to create a simple Metadata file to perform ID-based filtering and forget to include a header. The net effect was that filtering ignored the first ID in the file because that ID was interpreted as a header (and without a required header, QIIME 2 has no way of distinguishing between the presence or absence of a header). This lead to incorrect filtering behavior from the user’s perspective.

To alleviate these issues, the new format has a very minimal required header. Only the first column in the file (which contains the sample/feature IDs) is required to have a specific column name. The first column may have one of the following values:

Case-insensitive:

  • id
  • sampleid
  • sample id
  • sample-id
  • featureid
  • feature id
  • feature-id

Case-sensitive (these are mostly for backwards-compatibility with QIIME 1, biom-format, and Qiita files):

  • #SampleID
  • #Sample ID
  • #OTUID
  • #OTU ID
  • sample_name

If your metadata file’s first column doesn’t match one of the names listed above, an error will be raised when the file is loaded in QIIME 2.

Missing data

The previous format didn’t describe how to store missing data in Metadata files. See this forum topic for more details about how missing data were previously interpreted in QIIME 2, along with some discussion about missing data support in the new format.

Storing missing data in your metadata is simple: just use an empty cell! Values like NA, nan, etc. are no longer interpreted as missing data.

Column typing

Metadata column types (i.e. numeric vs categorical data) were previously inferred by QIIME 2. For example, if a column consisted only of numbers, the column would be inferred to be numeric. Users had no way to override this inference to state that the column is actually categorical data. For example, a Subject column where subjects are labeled 1, 2, 3 should be treated categorically and not numerically, but there was no way to override that behavior with the previous format. The workaround was to create a new column containing non-numeric values denoting subjects, which would cause QIIME 2 to infer the column type as categorical data.

With the new format, QIIME 2 will continue to infer column types in the same way as before. However, the new format supports a special comment directive that allows users to specify a column’s type, avoiding the inference described above.

The column typing comment directive is entirely optional, and if it is present, you don’t have to specify a type for each column. This makes it easy to fill in column types as necessary; if a column’s type isn’t declared, the type will be inferred as usual.

The comment directive must appear directly below the header and the first cell must be labelled #q2:types. Subsequent cells may be labelled categorical, numeric, or left empty to have the type inferred.

Here is a simple example:

#SampleID Subject BodySite DaysSinceExperimentStart
#q2:types categorical categorical numeric
sample-1 1 gut 20
sample-2 1 tongue 25
sample-3 2 gut 15
sample-4 2 tongue 42

Here we have two categorical columns (Subject and BodySite) and one numeric column (DaysSinceExperimentStart). Using the column typing comment directive, we are able to override QIIME 2’s inference for the Subject column by stating that the column is categorical data. If the Subject column wasn’t labelled as categorical, it would be interpreted as numeric data, which is probably not what the user intended.

Name change: metadata column vs category

In previous versions of QIIME 2 and QIIME 1, metadata columns were often referred to as metadata categories. Now that we support metadata column typing, which allows you to say whether a column contains numeric or categorical data, we would end up using terms like categorical metadata category or numeric metadata category, which can be confusing. We now avoid using the term category unless it is used in the context of categorical metadata.

This name change may be most noticeable for CLI users. Previously, any QIIME 2 action accepting a metadata column as input would have an option name ending with -category. The new option names end with -column, so previous commands will not work with the new naming scheme. For example, if a command previously used an option called --m-metadata-category, the new option name will be --m-metadata-column. If the CLI detects usage of the older option names, it will error and describe how to update your command to use the new option names.

Metadata validation

Metadata validation works the same as before: sample and feature metadata files stored in Google Sheets can be validated using Keemei.

QIIME 2 will also automatically validate a metadata file anytime it is used by the software. However, using Keemei to validate your metadata is recommended because a report of all validation errors and warnings will be presented each time Keemei is run. Loading your metadata in QIIME 2 will typically present only a single error at a time, which can make identifying and resolving validation issues cumbersome, especially if there are many issues with the metadata.

Metadata API changes

These changes affect Artifact API users and plugin/interface developers interacting with QIIME 2 Metadata.

There are many backwards-incompatible changes to the Metadata API. The changes listed below are not exhaustive; only the more pronounced changes are noted. If you have questions about updating your code to use the new API, please create new forum topic(s) and we’ll get in touch. Since there are so many API changes, it is likely that your code (plugins, interfaces, Artifact API) will need to be updated to continue functioning with the 2018.2 release.

Design overview

The qiime2.Metadata class continues to exist and has an updated API. The Metadata object is composed of zero or more qiime2.MetadataColumn objects.

The qiime2.MetadataCategory class has been renamed to qiime2.MetadataColumn, which is an abstract base class (ABC) and cannot be instantiated directly. There are two subclasses to represent categorical and numeric metadata columns: qiime2.CategoricalMetadataColumn and qiime2.NumericMetadataColumn, respectively. These concrete subclasses are the objects your code will interact with at runtime.

Metadata objects

When using qiime2.Metadata objects, here are some of the bigger changes to be aware of:

  • Metadata.load() has an optional column_types parameter, which allows you to override the metadata column types that are declared or inferred from the file being loaded. This is useful if you wish to programmatically override column types at runtime without having to modify the metadata file.

  • There is a new method, Metadata.save(), for saving a Metadata file to disk in TSV format. The Metadata file will be written with as much detail as possible to ensure that the file is roundtrippable. For example, a column types comment directive will always be written to indicate what column types were used at runtime, in order to make analyses reproducible without relying on column type inference.

  • Metadata.from_artifact() has been removed in favor of Artifact.view(Metadata), which matches the usual way of obtaining a particular view type from an Artifact (Metadata is no longer a special case in that regard).

  • The Metadata constructor continues to accept a pandas.DataFrame object, with the following requirements and considerations:

    • The dataframe’s index name (df.index.name) must match one of the required headers listed in the file format section above. An index name of None (the pandas default) is no longer accepted. It’s easy to set an index name when creating an index object on the dataframe, e.g. pd.Index([...], name='id').

    • If a column in the dataframe is dtype=object, it may contain strings or pandas missing values (e.g. np.nan, None). Columns matching this requirement are assumed to be categorical. Type casting/inference does not take place within the constructor currently.

    • If a column in the dataframe is dtype=float or dtype=int, it may contain floating point numbers or integers, as well as pandas missing values (e.g. np.nan). Columns matching this requirement are assumed to be numeric.

    • Regardless of column type (categorical vs numeric), the dataframe stored within the Metadata object will have any missing values normalized to use np.nan. Columns with dtype=int will be cast to dtype=float. To obtain a dataframe from Metadata containing these normalized data types and values, use Metadata.to_dataframe().

    • Since Metadata stores typed columns (dtype=object containing strings, or dtype=float containing numbers), there is no need to manually cast columns to the appropriate dtype anymore when using a dataframe obtained from Metadata.to_dataframe().

    • There are other restrictions placed on the values in the input dataframe that would prevent the file written by Metadata.save() from being roundtrippable. For example, IDs can’t begin with a pound sign (#) because the written out file would have that row represented as a comment line instead of a row containing an ID.

  • Metadata.columns is a read-only OrderedDict object mapping metadata column names to ColumnProperties namedtuples. The ColumnProperties namedtuple currently stores a single field called .type, which provides the column’s type as a string (either 'categorical' or 'numeric'). Since Metadata.columns is an ordered mapping, iterating over the columns will maintain the order of columns in the Metadata object. See Metadata.get_column() for retrieving a MetadataColumn object by column name.

  • Metadata.get_category() has been renamed to Metadata.get_column(). The method retrieves a MetadataColumn object (either CategoricalMetadataColumn or NumericMetadataColumn) based on the column name that’s requested.

  • The Metadata.ids() method has been renamed to Metadata.get_ids(). Its interface remains the same.

  • The Metadata.ids property returns a tuple of IDs.

  • The Metadata.id_count and Metadata.column_count properties provide access to the Metadata table dimensions.

  • The Metadata.id_header property returns the name specified in the input dataframe’s index (this corresponds to the first column’s name in the metadata file if loading from disk).

  • Metadata.filter() has been replaced by Metadata.filter_ids() and Metadata.filter_columns(). These methods perform ID-based and column-based filtering, respectively. Both methods return a filtered version of the Metadata object.

MetadataColumn objects

When using qiime2.MetadataColumn objects (i.e. CategoricalMetadataColumn and NumericMetadataColumn), here are some of the bigger changes to be aware of:

  • MetadataColumn.load() has been removed in favor of using Metadata.load() to obtain a Metadata object, followed by calling Metadata.get_column() to retrieve a column object.

  • MetadataColumn.from_artifact() has been removed in favor of using Artifact.view(Metadata) to obtain a Metadata object, followed by calling Metadata.get_column() to retrieve a column object.

  • The CategoricalMetadataColumn and NumericMetadataColumn constructors continue to accept a pandas.Series object, with the same requirements and considerations listed above for the Metadata constructor. The series object must have its .name property set to a valid metadata column name (the pandas default of None is no longer accepted). The series index name must also be one of the required ID header values listed earlier in this post.

  • There are several new methods and properties on the MetadataColumn object. Many of these APIs are similar to the methods/properties on the Metadata object described above.

    New properties: name, type, ids, id_count, id_header

    New methods: save(), to_dataframe(), get_ids(), filter_ids(), get_value(), has_missing_values(), drop_missing_values()

Plugin registration

In addition to the Metadata and MetadataColumn API changes described above, the way that plugins register MetadataColumn parameters is a bit different.

Note: If your plugin’s action accepts a qiime2.Metadata object, register your function in the same way as before. For example, if you have a method called foo that accepts a qiime2.Metadata object as input:

# function declaration
import qiime2
def foo(metadata: qiime2.Metadata):
    ...

# plugin registration
import qiime2.plugin
plugin.methods.register_function(
    function=foo,
    inputs={},
    parameters={'metadata': qiime2.plugin.Metadata},
    ...
)

Now let’s consider a method foo that accepts a MetadataColumn as input. Since metadata column are typed, you’ll have to decide whether your action accepts a CategoricalMetadataColumn, NumericMetadataColumn, or either type as input.

Let’s have the foo method accept a NumericMetadataColumn as input. Here’s what that might look like:

# function declaration
import qiime2
def foo(metadata: qiime2.NumericMetadataColumn):
    ...

# plugin registration
from qiime2.plugin import MetadataColumn, Numeric
plugin.methods.register_function(
    function=foo,
    inputs={},
    parameters={'metadata': MetadataColumn[Numeric]},
    ...
)

Similarly, if the method accepts a CategoricalMetadataColumn:

# function declaration
import qiime2
def foo(metadata: qiime2.CategoricalMetadataColumn):
    ...

# plugin registration
from qiime2.plugin import MetadataColumn, Categorical
plugin.methods.register_function(
    function=foo,
    inputs={},
    parameters={'metadata': MetadataColumn[Categorical]},
    ...
)

Finally, if the method accepts either a numeric or categorical column:

# function declaration
import qiime2
def foo(metadata: qiime2.MetadataColumn):
    # Note that the MetadataColumn ABC is used in the function annotation.
    # At runtime, the object received as input will be either
    # `NumericMetadataColumn` or `CategoricalMetadataColumn`.
    # You can differentiate between object types using `isinstance()`.
    # For example:
    if isinstance(metadata, qiime2.NumericMetadataColumn):
        ...
    elif isinstance(metadata, qiime2.CategoricalMetadataColumn):
        ...
    else:
        raise NotImplementedError()

# plugin registration
from qiime2.plugin import MetadataColumn, Numeric, Categorical
plugin.methods.register_function(
    function=foo,
    inputs={},
    parameters={'metadata': MetadataColumn[Numeric | Categorical]},
    ...
)

Additional resources

All of the plugins and interfaces included in the QIIME 2 Core 2018.2 Distribution have been updated to use the new Metadata API. In addition to the documentation in this post, looking at plugin or interface source code will provide concrete examples of the new Metadata API in action! Take a look at any of the repositories under the qiime2 GitHub organization that utilize QIIME 2 Metadata. q2-diversity and q2-feature-table are good examples of plugins using Metadata, and q2cli is an example of a QIIME 2 interface that leverages Metadata.

8 Likes

An off-topic reply has been split into a new topic: CondaHTTPError: HTTP 000 CONNECTION FAILED

Please keep replies on-topic in the future.