Implementation:EvolvingLMMs Lab Lmms eval GroupConfig
| Knowledge Sources | |
|---|---|
| Domains | Configuration, Task Management |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
GroupConfig and ConfigurableGroup provide the configuration and management infrastructure for task groups in the evaluation framework. Task groups allow related tasks to be evaluated together with aggregated metrics computed across the group's constituent tasks.
Description
The module defines three components: AggMetricConfig, a dataclass that specifies how a metric should be aggregated across subtasks; GroupConfig, a dataclass that holds the full group definition including its name, member tasks, and aggregation rules; and ConfigurableGroup, an abstract base class that wraps a GroupConfig and exposes its fields as properties. Both config dataclasses inherit from dict for backward compatibility with code that expects dictionary-like access.
Usage
Group configurations are typically defined in YAML files alongside task definitions. The framework loads these into GroupConfig instances, which are then wrapped by ConfigurableGroup subclasses during evaluation. Aggregated metrics are computed according to the aggregate_metric_list specification.
Code Reference
Source Location
- Repository: EvolvingLMMs-Lab/lmms-eval
- File:
lmms_eval/api/group.py - Lines: 1--105
Key Components
AggMetricConfig
@dataclass
class AggMetricConfig(dict):
metric: Optional[str] = None
aggregation: Optional[str] = "mean"
weight_by_size: Optional[str] = False
filter_list: Optional[Union[str, list]] = "none"
def __post_init__(self):
if self.aggregation != "mean" and not callable(self.aggregation):
raise ValueError(
f"Currently, 'mean' is the only pre-defined aggregation "
f"across groups' subtasks. Got '{self.aggregation}'."
)
if isinstance(self.filter_list, str):
self.filter_list = [self.filter_list]
Purpose: Configure how a metric should be aggregated across a group's subtasks.
Parameters:
metric-- Name of the metric to aggregate (e.g.,"accuracy","f1_score")aggregation-- Aggregation method; currently only"mean"or a custom callableweight_by_size-- Whether to weight by dataset size (default:False)filter_list-- Filter names to incorporate (default:"none"); normalized to a list in__post_init__
GroupConfig
@dataclass
class GroupConfig(dict):
group: Optional[str] = None
group_alias: Optional[str] = None
task: Optional[Union[str, list]] = None
aggregate_metric_list: Optional[
Union[List[AggMetricConfig], AggMetricConfig, dict]
] = None
metadata: Optional[dict] = None
Purpose: Configuration for a task group, including which tasks belong to it and how to aggregate their metrics.
Parameters:
group-- Group identifier/namegroup_alias-- Alternative display name for the grouptask-- Single task name or list of task names in this groupaggregate_metric_list-- Metrics to aggregate across tasks; accepts a single dict, a list of dicts, orAggMetricConfiginstancesmetadata-- Arbitrary user-defined metadata (not used by the framework)
GroupConfig.to_dict
def to_dict(self, keep_callable: bool = False) -> dict:
cfg_dict = asdict(self)
for k, v in list(cfg_dict.items()):
if callable(v):
cfg_dict[k] = self.serialize_function(v, keep_callable=keep_callable)
return cfg_dict
Purpose: Convert the configuration to a dictionary suitable for logging or results output.
Parameters:
keep_callable-- IfFalse, converts callables to source code strings viainspect.getsource
Returns: Dictionary representation of the config with callable values serialized.
GroupConfig.serialize_function
def serialize_function(
self, value: Union[Callable, str], keep_callable=False
) -> Union[Callable, str]:
if keep_callable:
return value
else:
try:
return getsource(value)
except (TypeError, OSError):
return str(value)
Purpose: Serialize a callable configuration value to its source code string, falling back to str().
ConfigurableGroup
class ConfigurableGroup(abc.ABC):
def __init__(self, config: Optional[dict] = None) -> None:
self._config = GroupConfig(**config)
@property
def group(self):
return self._config.group
@property
def group_alias(self):
return self._config.group_alias
@property
def version(self):
return self._config.version
@property
def config(self):
return self._config.to_dict()
@property
def group_name(self) -> Any:
return self._config.group
def __repr__(self):
return (
f"ConfigurableGroup(group={self.group},"
f"group_alias={self.group_alias})"
)
Purpose: Abstract base class for group implementations that wraps a GroupConfig and exposes its fields as properties.
I/O Contract
| Input | Type | Description |
|---|---|---|
| config | dict |
Dictionary of group configuration fields |
| Output | Type | Description |
|---|---|---|
| GroupConfig | GroupConfig |
Dataclass holding group name, tasks, and aggregation rules |
| to_dict() | dict |
Serialized configuration dictionary for logging |
Integration with Framework
YAML Configuration
group: mmlu
group_alias: "Massive Multitask Language Understanding"
task:
- mmlu_humanities
- mmlu_stem
- mmlu_social_sciences
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
paper: "Measuring Massive Multitask Language Understanding"
year: 2020
Results Output
{
"results": {
"mmlu": {
"acc": 0.75,
"config": {
"group": "mmlu",
"group_alias": "MMLU Benchmark",
"task": ["mmlu_humanities", "mmlu_stem", "mmlu_social_sciences"],
"aggregate_metric_list": [{"metric": "acc", "aggregation": "mean"}]
}
}
}
}
Design Decisions
- Dict inheritance -- Both config classes inherit from
dictfor backward compatibility with code expecting dictionary-like objects. - Flexible input --
aggregate_metric_listaccepts a single item, a list, or a list of dicts for convenience. - Metadata field -- Extension point for custom use cases without modifying the framework.
- Callable serialization -- Uses
inspect.getsource()to preserve function definitions in logs and output. - Mean-only aggregation -- Restricts to mean aggregation with the option to extend via custom callables.
- Property access --
ConfigurableGroupexposes config fields as properties for a cleaner API.