Principle:EvolvingLMMs Lab Lmms eval Dataset Preparation

Knowledge Sources	lmms-eval
Domains	Data_Processing, Evaluation
Last Updated	2026-02-14 00:00 GMT

Overview

Evaluation datasets must be retrieved from a repository, loaded into memory, and optionally preprocessed before any benchmark task can run.

Description

Dataset preparation is the foundational step in any evaluation pipeline. Before a model can be scored on a benchmark, the raw evaluation data must be acquired, parsed into a structured format, and made available to downstream components that construct prompts and compute metrics.

In the context of multimodal evaluation, dataset preparation carries additional complexity because datasets frequently contain heterogeneous column types: text fields, categorical labels, PIL images, video file paths, and audio streams. A robust preparation pipeline must handle all of these modalities while providing a uniform interface to the rest of the system.

The general workflow for dataset preparation follows three stages:

1. Acquisition: The dataset is downloaded from a remote hub (such as HuggingFace Hub) or loaded from a local directory. The acquisition step is controlled by a dataset identifier (a path or repository name) and an optional subset name. Caching and retry logic ensure that transient network failures do not prevent evaluation.

2. Schema inspection: Once loaded, the dataset's column schema is inspected. Columns containing binary media (images, video, audio) are identified so that a lightweight "no-image" copy of the dataset can be maintained for operations that do not require heavy media objects, such as logging and serialization.

3. Preprocessing (optional): A user-supplied function (often called process_docs) can be applied to each split of the dataset. This function receives an entire HuggingFace Dataset split and returns a transformed version. Common uses include filtering rows, renaming columns, augmenting documents with derived fields, or converting answer formats.

The separation between acquisition and preprocessing is important: it allows the same raw dataset to be reused across multiple tasks that differ only in how they interpret or filter the data.

Usage

Use dataset preparation whenever you are creating a new evaluation task. You must specify at minimum a dataset_path pointing to a HuggingFace dataset repository (e.g., "lmms-lab/MME") and a split to evaluate on (e.g., test_split: test). If the dataset contains multiple configurations or subsets, provide dataset_name as well. If the raw data requires transformation before evaluation, implement a process_docs function in your task's utils.py and reference it with the !function YAML directive.

Theoretical Basis

Dataset preparation can be modeled as a simple pipeline:

RawSource(path, name) --> Load(split) --> D_raw
D_raw --> Inspect(features) --> D_raw, D_no_media
D_raw --> PreProcess(f) --> D_processed   (optional)

Where:

path is the HuggingFace repository identifier or local path
name is the optional dataset configuration name
split is one of training, validation, test, or fewshot
f is an optional preprocessing callable: f: Dataset -> Dataset

The key invariant is that each row of the resulting dataset is a document dict -- a dictionary whose keys correspond to dataset column names, and whose values are the column entries for that row. All downstream components (prompt construction, visual extraction, metric computation) operate on these document dicts.

For multimodal datasets, the media inspection step ensures:

D_no_media = D_raw.remove_columns({c | type(c) in {Image, Sequence[Image], Audio}})

This lightweight copy is used for serialization and logging where media objects would be unnecessarily expensive.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment