Implementation:Interpretml Interpret Powerlift TaskMeasures
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Statistics, Data_Analysis |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Collection of statistical measure functions that compute dataset characteristics (entropy, class statistics, regression statistics, and data statistics) for Powerlift benchmarking tasks.
Description
This module provides four functions used to compute and populate statistical metadata for benchmarking datasets:
- entropy() -- Computes the Shannon entropy of a label distribution. Supports configurable logarithmic base and normalized entropy (entropy divided by the maximum possible entropy for the number of classes). Returns 0 for distributions with 1 or fewer labels/classes.
- class_stats() -- Computes classification-specific statistics for a target series: number of classes, normalized entropy, minimum class count, and maximum class count. Results are written into a provided
metadictionary.
- regression_stats() -- Computes regression-specific statistics for a response series: number of classes (set to 0), minimum value, average value, and maximum value. Results are written into a provided
metadictionary.
- data_stats() -- Computes feature-level statistics for a DataFrame: number of samples, number of features, maximum unique continuous values, maximum categories per feature, total categories, percentage of categorical features, and percentage of special values (NaN, empty strings for categorical; NaN, zero for continuous). Results are written into a provided
metadictionary.
All statistics functions populate a metadata dictionary that is later stored in the Task model's meta field and individual Task columns.
Usage
Use these functions when registering new datasets with Powerlift. They are called during task creation to compute the statistical profile of each dataset, which is stored alongside the data for filtering and analysis of benchmark results.
Code Reference
Source Location
- Repository: Interpretml_Interpret
- File:
python/powerlift/powerlift/measures/task_measures.py
Signature
def entropy(
labels: Iterable, base: Optional[Number] = None, normalized: bool = False
) -> Number:
...
def class_stats(y: pd.Series, meta):
...
def regression_stats(y: pd.Series, meta):
...
def data_stats(X: pd.DataFrame, categorical_mask: Iterable[bool], meta):
...
Import
from powerlift.measures.task_measures import entropy, class_stats, regression_stats, data_stats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| labels | Iterable | Yes | Label array for entropy computation |
| base | Number | No | Logarithmic base for entropy (defaults to e for natural log) |
| normalized | bool | No | Whether to return normalized entropy (default: False) |
| y | pd.Series | Yes | Target/response series for class_stats or regression_stats |
| X | pd.DataFrame | Yes | Feature DataFrame for data_stats |
| categorical_mask | Iterable[bool] | Yes | Boolean mask indicating which columns are categorical (for data_stats) |
| meta | dict | Yes | Mutable dictionary to populate with computed statistics |
Outputs
| Name | Type | Description |
|---|---|---|
| entropy return | Number | Shannon entropy value (or normalized entropy) |
| meta["n_classes"] | int | Number of unique classes (set by class_stats) |
| meta["class_normalized_entropy"] | float | Normalized entropy of class distribution (set by class_stats) |
| meta["min_class_count"] | int | Minimum class count (set by class_stats) |
| meta["max_class_count"] | int | Maximum class count (set by class_stats) |
| meta["n_samples"] | int | Number of samples (set by data_stats) |
| meta["n_features"] | int | Number of features (set by data_stats) |
| meta["max_unique_continuous"] | int | Maximum unique values in any continuous feature (set by data_stats) |
| meta["max_categories"] | int | Maximum categories in any categorical feature (set by data_stats) |
| meta["total_categories"] | int | Total categories across all categorical features (set by data_stats) |
| meta["percent_categorical"] | float | Proportion of features that are categorical (set by data_stats) |
| meta["percent_special_values"] | float | Proportion of cells with special values (set by data_stats) |
Usage Examples
import pandas as pd
from powerlift.measures.task_measures import entropy, class_stats, data_stats
# Compute entropy of a label distribution
labels = [0, 0, 1, 1, 2, 2, 2]
ent = entropy(labels)
norm_ent = entropy(labels, normalized=True)
# Compute class statistics
y = pd.Series([0, 0, 1, 1, 2, 2, 2])
meta = {}
class_stats(y, meta)
# meta == {"n_classes": 3, "class_normalized_entropy": ..., "min_class_count": 2, "max_class_count": 3}
# Compute data statistics
X = pd.DataFrame({"age": [25, 30, 35], "color": ["red", "blue", "red"]})
categorical_mask = [False, True]
data_stats(X, categorical_mask, meta)
# meta now also includes n_samples, n_features, max_unique_continuous, etc.
Related Pages
- Interpretml_Interpret_Powerlift_Schema -- Task model where computed statistics are stored
- Interpretml_Interpret_Powerlift_RunTrials -- Trial runner that accesses task metadata during execution