Implementation:EvolvingLMMs Lab Lmms eval Baseline Loader

- File**: `/tmp/kapso_repo_sslb_59s/lmms_eval/baselines/loader.py`

- Principle**: Baseline_Comparison

1. Overview

The Baseline Loader provides utilities for loading baseline results from multiple sources (registry presets, local files, HuggingFace Hub) and extracting scores for comparison with current evaluation results. It supports flexible input formats and handles various result file structures.

1. Key Components

1. 1. 1. Main Loading Function

1. 1. 1. load_baseline

```python def load_baseline(baseline_arg: str, task_name: str) -> Tuple[Dict[int, Any], Optional[Dict[str, Any]]]: ```

- Purpose**: Main entry point for loading baseline results from any source

- Parameters**:

- `baseline_arg` (str): One of:

 - Model preset: "qwen25vl" (auto-match task from BASELINE_REGISTRY)
 - Model:task preset: "qwen25vl:mmbench" (explicit task)
 - Local path: "/path/to/results.jsonl"
 - HF URL: "hf://user/repo/file.jsonl"

- `task_name` (str): Current task name for auto-matching presets

- Returns**: Tuple of:

- `doc_id_to_scores` (Dict[int, Any]): Mapping from document ID to score - `aggregated_results` (Optional[Dict[str, Any]]): Optional aggregated metrics

- Implementation Logic**:

```python

Check for explicit model:task format

if ":" in baseline_arg and not baseline_arg.startswith("hf://"):

   model_name, explicit_task = baseline_arg.split(":", 1)
   if model_name in BASELINE_REGISTRY:
       return _load_from_registry(model_name, explicit_task, baseline_arg)

Check for model preset (auto-match task)

if baseline_arg in BASELINE_REGISTRY:

   return _load_from_registry(baseline_arg, task_name, baseline_arg)

Direct HF URL

if baseline_arg.startswith("hf://") or "huggingface.co" in baseline_arg:

   return _load_baseline_from_hf(baseline_arg, task_name)

Local path

if os.path.exists(baseline_arg):

   return _load_baseline_from_local(baseline_arg, task_name)

raise ValueError(f"Cannot load baseline '{baseline_arg}'. " f"Available presets: {list(BASELINE_REGISTRY.keys())}") ```

1. 1. 2. Registry Loading

1. 1. 1. _load_from_registry

```python def _load_from_registry(model_name: str, task_name: str, baseline_arg: str) -> Tuple[Dict[int, Any], Optional[Dict[str, Any]]]: ```

- Purpose**: Load baseline from registry by model and task name

- Implementation**:

```python model_entry = BASELINE_REGISTRY[model_name]

Check if task exists for this model

if task_name not in model_entry:

   available_tasks = [k for k in model_entry.keys() if not k.startswith("_")]
   raise ValueError(f"No baseline for model '{model_name}' on task '{task_name}'. " f"Available tasks: {available_tasks}")

task_entry = model_entry[task_name] eval_logger.info(f"[Baseline] Using preset '{model_name}' for task '{task_name}'")

if "hf_url" in task_entry:

   return _load_baseline_from_hf(task_entry["hf_url"], task_name)

elif "path" in task_entry:

   return _load_baseline_from_local(task_entry["path"], task_name)

else:

   raise ValueError(f"Preset '{baseline_arg}' has no 'hf_url' or 'path'")

```

- Key Features**:

- Validates task availability for model - Lists available tasks on error - Delegates to HF or local loading based on entry - Logs preset usage

1. 1. 3. Local File Loading

1. 1. 1. _load_baseline_from_local

```python def _load_baseline_from_local(path: str, task_name: str) -> Tuple[Dict[int, Any], Optional[Dict[str, Any]]]: ```

- Purpose**: Load baseline from local JSONL file

- Implementation**:

```python eval_logger.info(f"[Baseline] Loading from: {path}") doc_id_to_scores = {} with open(path, "r", encoding="utf-8") as f:

   for line in f:
       if not line.strip():
           continue
       sample = json.loads(line)
       doc_id = sample.get("doc_id")
       if doc_id is None:
           continue
       score = _extract_score_from_sample(sample, task_name)
       if score is not None:
           doc_id_to_scores[doc_id] = score

eval_logger.info(f"[Baseline] Loaded {len(doc_id_to_scores)} samples") ```

- Key Features**:

- Line-by-line parsing for memory efficiency - Skips empty lines and samples without doc_id - Extracts scores using flexible extraction logic - Logs number of samples loaded

- Aggregated Results Loading**:

```python

Try to load aggregated results

agg_results = None dir_path = os.path.dirname(path) base_name = os.path.basename(path) parts = base_name.split("_samples_") if len(parts) == 2:

   results_path = os.path.join(dir_path, parts[0] + "_results.json")
   if os.path.exists(results_path):
       with open(results_path, "r") as f:
           agg_results = json.load(f)

return doc_id_to_scores, agg_results ```

- Convention**: Looks for companion `_results.json` file based on `_samples_` naming pattern

1. 1. 4. HuggingFace Hub Loading

1. 1. 1. _load_baseline_from_hf

```python def _load_baseline_from_hf(hf_path: str, task_name: str) -> Tuple[Dict[int, Any], Optional[Dict[str, Any]]]: ```

- Purpose**: Load baseline from HuggingFace Hub dataset

- Path Formats**:

- `hf://user/repo/file.jsonl`: Download specific file - `hf://user/repo`: Download first JSONL file found - `huggingface.co/datasets/user/repo`: Web URL format

- Implementation**:

```python from huggingface_hub import hf_hub_download, list_repo_files

Parse HF path: hf://user/repo/file.jsonl or hf://user/repo

if hf_path.startswith("hf://"):

   path_parts = hf_path[5:].split("/")
   if len(path_parts) >= 3:
       # hf://user/repo/file.jsonl -> download specific file
       repo_id = "/".join(path_parts[:2])
       filename = "/".join(path_parts[2:])
       eval_logger.info(f"[Baseline] Loading from HF: {repo_id}/{filename}")
       local_path = hf_hub_download(repo_id, filename, repo_type="dataset")
       return _load_baseline_from_local(local_path, task_name)
   else:
       repo_id = "/".join(path_parts)

else:

   repo_id = hf_path.split("huggingface.co/datasets/")[-1].rstrip("/")

eval_logger.info(f"[Baseline] Loading from HF: {repo_id}") files = list_repo_files(repo_id, repo_type="dataset") jsonl_files = [f for f in files if f.endswith(".jsonl")] if not jsonl_files:

   raise ValueError(f"No JSONL files in HF repo: {repo_id}")

local_path = hf_hub_download(repo_id, jsonl_files[0], repo_type="dataset") return _load_baseline_from_local(local_path, task_name) ```

- Key Features**:

- Downloads to local cache automatically - Supports explicit file or first JSONL in repo - Handles both `hf://` and web URL formats - Delegates to local loading after download

1. 1. 5. Score Extraction

1. 1. 1. _extract_score_from_sample

```python def _extract_score_from_sample(sample: Dict[str, Any], task_name: str) -> Optional[float]: ```

- Purpose**: Extract score from a sample dict using flexible strategies

- Strategy 1: Find Score Field**

```python

Try task-specific score key

for key in sample:

   if "score" in key.lower():
       val = sample[key]
       if isinstance(val, (int, float)):
           return float(val)

``` Looks for any field containing "score" with numeric value

- Strategy 2: Extract from Dict Value**

```python elif isinstance(val, dict):

   pred = val.get("pred_answer") or val.get("pred")
   ans = val.get("answer") or val.get("target")
   if pred and ans:
       return 1.0 if str(pred).strip().upper() == str(ans).strip().upper() else 0.0

``` If score field is dict, extract prediction and answer, compute match

- Strategy 3: Fallback Computation**

```python

Fallback: compute from target and filtered_resps

target = sample.get("target") filtered_resps = sample.get("filtered_resps") if target and filtered_resps:

   pred = filtered_resps[0] if isinstance(filtered_resps, list) else filtered_resps
   if isinstance(pred, list):
       pred = pred[0] if pred else ""
   return 1.0 if str(pred).strip().upper() == str(target).strip().upper() else 0.0

return None ``` Compute score from target and filtered_resps fields

- Key Features**:

- Multiple fallback strategies - Case-insensitive comparison - Handles various field names - Returns None if extraction fails

1. Dependencies

- `json`: JSON parsing - `os`: File path operations - `typing`: Type annotations - `lmms_eval.baselines.registry`: BASELINE_REGISTRY - `huggingface_hub`: HF Hub download (optional import) - `loguru` or `logging`: Logging

1. Usage Examples

1. 1. Load from Registry Preset

```python from lmms_eval.baselines.loader import load_baseline

Auto-match task

doc_scores, agg_results = load_baseline("qwen25vl", task_name="videomme")

Explicit task

doc_scores, agg_results = load_baseline("qwen25vl:videomme", task_name="videomme") ```

1. 1. Load from Local File

```python doc_scores, agg_results = load_baseline(

   "/path/to/baseline_samples_videomme.jsonl",
   task_name="videomme"

)

Check if aggregated results found

if agg_results:

   print(f"Baseline accuracy: {agg_results['results']['videomme']['acc']}")

```

1. 1. Load from HuggingFace Hub

```python

Specific file

doc_scores, agg_results = load_baseline(

   "hf://user/repo/results.jsonl",
   task_name="videomme"

)

First JSONL in repo

doc_scores, agg_results = load_baseline(

   "hf://user/repo",
   task_name="videomme"

)

Web URL format

doc_scores, agg_results = load_baseline(

   "https://huggingface.co/datasets/user/repo",
   task_name="videomme"

) ```

1. 1. Use in Evaluation

```python

Load baseline

baseline_scores, _ = load_baseline("qwen25vl", "videomme")

Compare with current results

for doc_id, current_result in evaluation_results.items():

   baseline_score = baseline_scores.get(doc_id)
   if baseline_score is not None:
       current_score = current_result["score"]
       diff = current_score - baseline_score
       print(f"Doc {doc_id}: {current_score:.3f} vs {baseline_score:.3f} (diff: {diff:+.3f})")

```

1. Error Handling

1. 1. Missing Baseline

```python

Raises ValueError with available presets

load_baseline("nonexistent_model", "task")

ValueError: Cannot load baseline 'nonexistent_model'. Available presets: ['qwen25vl', ...]

```

1. 1. Missing Task

```python

Raises ValueError with available tasks for model

load_baseline("qwen25vl:nonexistent_task", "videomme")

ValueError: No baseline for model 'qwen25vl' on task 'nonexistent_task'. Available tasks: ['videomme', 'mmbench']

```

1. 1. Missing File

```python

Raises FileNotFoundError

load_baseline("/nonexistent/path.jsonl", "task") ```

1. 1. Malformed Data

```python

Skips samples without doc_id or extractable scores
Logs warning and continues

```

1. Design Decisions

1. **Flexible Input**: Single function handles all input types for simplicity

2. **Tuple Return**: Returns both per-sample and aggregated results together

3. **Optional Aggregation**: Aggregated results optional, attempts conventional filename

4. **JSONL Format**: Line-by-line parsing for memory efficiency with large files

5. **Score Extraction Strategies**: Multiple fallbacks handle various result formats

6. **HuggingFace Integration**: Uses Hub caching for efficient repeated access

7. **Registry Separation**: Registry in separate file for easy maintenance

8. **Logging**: Info-level logs for transparency without verbosity

1. File Format Expectations

1. 1. Sample JSONL Format

```jsonl {"doc_id": 0, "score": 0.85, "pred_answer": "A", "answer": "A"} {"doc_id": 1, "score": 0.0, "pred_answer": "B", "answer": "C"} ```

Or with nested score dict: ```jsonl {"doc_id": 0, "score": {"pred_answer": "A", "answer": "A"}} {"doc_id": 1, "score": {"pred": "B", "target": "C"}} ```

Or with framework format: ```jsonl {"doc_id": 0, "target": "A", "filtered_resps": ["A"]} {"doc_id": 1, "target": "C", "filtered_resps": ["B"]} ```

1. 1. Aggregated Results Format

```json {

 "results": {
   "videomme": {
     "acc": 0.6833,
     "acc_stderr": 0.0123
   }
 }

} ```

1. Related Components

- Baseline_Registry: Registry of available baselines - Baseline_Comparison: Principle this implements - Results_Output: Uses baseline data for comparison output - Post_Processing_and_Metrics: Metrics compared against baseline

1. Best Practices

1. Use registry presets for reproducibility 2. Include doc_id in all baseline samples 3. Follow naming convention for aggregated results 4. Store baselines on HuggingFace for sharing 5. Document baseline conditions in registry 6. Test score extraction with sample data 7. Handle missing baselines gracefully 8. Log baseline source for transparency

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment