Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval GroupConfig

From Leeroopedia
Knowledge Sources
Domains Configuration, Task Management
Last Updated 2026-02-14 00:00 GMT

Overview

GroupConfig and ConfigurableGroup provide the configuration and management infrastructure for task groups in the evaluation framework. Task groups allow related tasks to be evaluated together with aggregated metrics computed across the group's constituent tasks.

Description

The module defines three components: AggMetricConfig, a dataclass that specifies how a metric should be aggregated across subtasks; GroupConfig, a dataclass that holds the full group definition including its name, member tasks, and aggregation rules; and ConfigurableGroup, an abstract base class that wraps a GroupConfig and exposes its fields as properties. Both config dataclasses inherit from dict for backward compatibility with code that expects dictionary-like access.

Usage

Group configurations are typically defined in YAML files alongside task definitions. The framework loads these into GroupConfig instances, which are then wrapped by ConfigurableGroup subclasses during evaluation. Aggregated metrics are computed according to the aggregate_metric_list specification.

Code Reference

Source Location

  • Repository: EvolvingLMMs-Lab/lmms-eval
  • File: lmms_eval/api/group.py
  • Lines: 1--105

Key Components

AggMetricConfig

@dataclass
class AggMetricConfig(dict):
    metric: Optional[str] = None
    aggregation: Optional[str] = "mean"
    weight_by_size: Optional[str] = False
    filter_list: Optional[Union[str, list]] = "none"

    def __post_init__(self):
        if self.aggregation != "mean" and not callable(self.aggregation):
            raise ValueError(
                f"Currently, 'mean' is the only pre-defined aggregation "
                f"across groups' subtasks. Got '{self.aggregation}'."
            )
        if isinstance(self.filter_list, str):
            self.filter_list = [self.filter_list]

Purpose: Configure how a metric should be aggregated across a group's subtasks.

Parameters:

  • metric -- Name of the metric to aggregate (e.g., "accuracy", "f1_score")
  • aggregation -- Aggregation method; currently only "mean" or a custom callable
  • weight_by_size -- Whether to weight by dataset size (default: False)
  • filter_list -- Filter names to incorporate (default: "none"); normalized to a list in __post_init__

GroupConfig

@dataclass
class GroupConfig(dict):
    group: Optional[str] = None
    group_alias: Optional[str] = None
    task: Optional[Union[str, list]] = None
    aggregate_metric_list: Optional[
        Union[List[AggMetricConfig], AggMetricConfig, dict]
    ] = None
    metadata: Optional[dict] = None

Purpose: Configuration for a task group, including which tasks belong to it and how to aggregate their metrics.

Parameters:

  • group -- Group identifier/name
  • group_alias -- Alternative display name for the group
  • task -- Single task name or list of task names in this group
  • aggregate_metric_list -- Metrics to aggregate across tasks; accepts a single dict, a list of dicts, or AggMetricConfig instances
  • metadata -- Arbitrary user-defined metadata (not used by the framework)

GroupConfig.to_dict

def to_dict(self, keep_callable: bool = False) -> dict:
    cfg_dict = asdict(self)
    for k, v in list(cfg_dict.items()):
        if callable(v):
            cfg_dict[k] = self.serialize_function(v, keep_callable=keep_callable)
    return cfg_dict

Purpose: Convert the configuration to a dictionary suitable for logging or results output.

Parameters:

  • keep_callable -- If False, converts callables to source code strings via inspect.getsource

Returns: Dictionary representation of the config with callable values serialized.

GroupConfig.serialize_function

def serialize_function(
    self, value: Union[Callable, str], keep_callable=False
) -> Union[Callable, str]:
    if keep_callable:
        return value
    else:
        try:
            return getsource(value)
        except (TypeError, OSError):
            return str(value)

Purpose: Serialize a callable configuration value to its source code string, falling back to str().

ConfigurableGroup

class ConfigurableGroup(abc.ABC):
    def __init__(self, config: Optional[dict] = None) -> None:
        self._config = GroupConfig(**config)

    @property
    def group(self):
        return self._config.group

    @property
    def group_alias(self):
        return self._config.group_alias

    @property
    def version(self):
        return self._config.version

    @property
    def config(self):
        return self._config.to_dict()

    @property
    def group_name(self) -> Any:
        return self._config.group

    def __repr__(self):
        return (
            f"ConfigurableGroup(group={self.group},"
            f"group_alias={self.group_alias})"
        )

Purpose: Abstract base class for group implementations that wraps a GroupConfig and exposes its fields as properties.

I/O Contract

Input Type Description
config dict Dictionary of group configuration fields
Output Type Description
GroupConfig GroupConfig Dataclass holding group name, tasks, and aggregation rules
to_dict() dict Serialized configuration dictionary for logging

Integration with Framework

YAML Configuration

group: mmlu
group_alias: "Massive Multitask Language Understanding"
task:
  - mmlu_humanities
  - mmlu_stem
  - mmlu_social_sciences
aggregate_metric_list:
  - metric: acc
    aggregation: mean
    weight_by_size: true
metadata:
  paper: "Measuring Massive Multitask Language Understanding"
  year: 2020

Results Output

{
  "results": {
    "mmlu": {
      "acc": 0.75,
      "config": {
        "group": "mmlu",
        "group_alias": "MMLU Benchmark",
        "task": ["mmlu_humanities", "mmlu_stem", "mmlu_social_sciences"],
        "aggregate_metric_list": [{"metric": "acc", "aggregation": "mean"}]
      }
    }
  }
}

Design Decisions

  • Dict inheritance -- Both config classes inherit from dict for backward compatibility with code expecting dictionary-like objects.
  • Flexible input -- aggregate_metric_list accepts a single item, a list, or a list of dicts for convenience.
  • Metadata field -- Extension point for custom use cases without modifying the framework.
  • Callable serialization -- Uses inspect.getsource() to preserve function definitions in logs and output.
  • Mean-only aggregation -- Restricts to mean aggregation with the option to extend via custom callables.
  • Property access -- ConfigurableGroup exposes config fields as properties for a cleaner API.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment