Implementation:EvolvingLMMs Lab Lmms eval GroupConfig

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Configuration, Task Management
Last Updated	2026-02-14 00:00 GMT

Overview

GroupConfig and ConfigurableGroup provide the configuration and management infrastructure for task groups in the evaluation framework. Task groups allow related tasks to be evaluated together with aggregated metrics computed across the group's constituent tasks.

Description

The module defines three components: AggMetricConfig, a dataclass that specifies how a metric should be aggregated across subtasks; GroupConfig, a dataclass that holds the full group definition including its name, member tasks, and aggregation rules; and ConfigurableGroup, an abstract base class that wraps a GroupConfig and exposes its fields as properties. Both config dataclasses inherit from dict for backward compatibility with code that expects dictionary-like access.

Usage

Group configurations are typically defined in YAML files alongside task definitions. The framework loads these into GroupConfig instances, which are then wrapped by ConfigurableGroup subclasses during evaluation. Aggregated metrics are computed according to the aggregate_metric_list specification.

Code Reference

Source Location

Repository: EvolvingLMMs-Lab/lmms-eval
File: lmms_eval/api/group.py
Lines: 1--105

Key Components

AggMetricConfig

@dataclass
class AggMetricConfig(dict):
    metric: Optional[str] = None
    aggregation: Optional[str] = "mean"
    weight_by_size: Optional[str] = False
    filter_list: Optional[Union[str, list]] = "none"

    def __post_init__(self):
        if self.aggregation != "mean" and not callable(self.aggregation):
            raise ValueError(
                f"Currently, 'mean' is the only pre-defined aggregation "
                f"across groups' subtasks. Got '{self.aggregation}'."
            )
        if isinstance(self.filter_list, str):
            self.filter_list = [self.filter_list]

Purpose: Configure how a metric should be aggregated across a group's subtasks.

Parameters:

metric -- Name of the metric to aggregate (e.g., "accuracy", "f1_score")
aggregation -- Aggregation method; currently only "mean" or a custom callable
weight_by_size -- Whether to weight by dataset size (default: False)
filter_list -- Filter names to incorporate (default: "none"); normalized to a list in __post_init__

GroupConfig

@dataclass
class GroupConfig(dict):
    group: Optional[str] = None
    group_alias: Optional[str] = None
    task: Optional[Union[str, list]] = None
    aggregate_metric_list: Optional[
        Union[List[AggMetricConfig], AggMetricConfig, dict]
    ] = None
    metadata: Optional[dict] = None

Purpose: Configuration for a task group, including which tasks belong to it and how to aggregate their metrics.

Parameters:

group -- Group identifier/name
group_alias -- Alternative display name for the group
task -- Single task name or list of task names in this group
aggregate_metric_list -- Metrics to aggregate across tasks; accepts a single dict, a list of dicts, or AggMetricConfig instances
metadata -- Arbitrary user-defined metadata (not used by the framework)

GroupConfig.to_dict

def to_dict(self, keep_callable: bool = False) -> dict:
    cfg_dict = asdict(self)
    for k, v in list(cfg_dict.items()):
        if callable(v):
            cfg_dict[k] = self.serialize_function(v, keep_callable=keep_callable)
    return cfg_dict

Purpose: Convert the configuration to a dictionary suitable for logging or results output.

Parameters:

keep_callable -- If False, converts callables to source code strings via inspect.getsource

Returns: Dictionary representation of the config with callable values serialized.

GroupConfig.serialize_function

def serialize_function(
    self, value: Union[Callable, str], keep_callable=False
) -> Union[Callable, str]:
    if keep_callable:
        return value
    else:
        try:
            return getsource(value)
        except (TypeError, OSError):
            return str(value)

Purpose: Serialize a callable configuration value to its source code string, falling back to str().

ConfigurableGroup

class ConfigurableGroup(abc.ABC):
    def __init__(self, config: Optional[dict] = None) -> None:
        self._config = GroupConfig(**config)

    @property
    def group(self):
        return self._config.group

    @property
    def group_alias(self):
        return self._config.group_alias

    @property
    def version(self):
        return self._config.version

    @property
    def config(self):
        return self._config.to_dict()

    @property
    def group_name(self) -> Any:
        return self._config.group

    def __repr__(self):
        return (
            f"ConfigurableGroup(group={self.group},"
            f"group_alias={self.group_alias})"
        )

Purpose: Abstract base class for group implementations that wraps a GroupConfig and exposes its fields as properties.

I/O Contract

Input	Type	Description
config	`dict`	Dictionary of group configuration fields

Output	Type	Description
GroupConfig	`GroupConfig`	Dataclass holding group name, tasks, and aggregation rules
to_dict()	`dict`	Serialized configuration dictionary for logging

Integration with Framework

YAML Configuration

group: mmlu
group_alias: "Massive Multitask Language Understanding"
task:
  - mmlu_humanities
  - mmlu_stem
  - mmlu_social_sciences
aggregate_metric_list:
  - metric: acc
    aggregation: mean
    weight_by_size: true
metadata:
  paper: "Measuring Massive Multitask Language Understanding"
  year: 2020

Results Output

{
  "results": {
    "mmlu": {
      "acc": 0.75,
      "config": {
        "group": "mmlu",
        "group_alias": "MMLU Benchmark",
        "task": ["mmlu_humanities", "mmlu_stem", "mmlu_social_sciences"],
        "aggregate_metric_list": [{"metric": "acc", "aggregation": "mean"}]
      }
    }
  }
}

Design Decisions

Dict inheritance -- Both config classes inherit from dict for backward compatibility with code expecting dictionary-like objects.
Flexible input -- aggregate_metric_list accepts a single item, a list, or a list of dicts for convenience.
Metadata field -- Extension point for custom use cases without modifying the framework.
Callable serialization -- Uses inspect.getsource() to preserve function definitions in logs and output.
Mean-only aggregation -- Restricts to mean aggregation with the option to extend via custom callables.
Property access -- ConfigurableGroup exposes config fields as properties for a cleaner API.

Related Pages

Principle: EvolvingLMMs_Lab_Lmms_eval_Task_Directory_Structure
See Also: EvolvingLMMs_Lab_Lmms_eval_YAML_Task_Configuration, EvolvingLMMs_Lab_Lmms_eval_Post_Processing_and_Metrics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment