Principle:EvolvingLMMs Lab Lmms eval Baseline Comparison

1. Overview

Baseline Comparison enables the evaluation framework to load and compare model performance against established baseline results from other models. This principle defines how baseline results are stored, retrieved, and integrated into the evaluation process to provide performance context and facilitate model comparison.

1. Core Concepts

1. 1. 1. Baseline Sources

Baseline results can come from multiple sources: - **Registry Presets**: Predefined model-task combinations with known locations - **Local Files**: JSONL files stored on the local filesystem - **HuggingFace Hub**: Datasets hosted on HuggingFace containing baseline results - **Model-Task Matrix**: Two-dimensional organization (model × task)

1. 1. 2. Result Storage Format

Baseline results are stored as JSONL (JSON Lines) files: - Each line represents one evaluation sample - Contains doc_id for matching with current evaluation - Includes scores/metrics for comparison - May include predictions and targets - Separate aggregated results in JSON files

1. 1. 3. Registry Structure

Central registry maps model names to their baseline results: - Model-level entries with metadata - Task-specific entries under each model - HuggingFace URLs or local paths - Descriptions for documentation

1. 1. 4. Loading Strategy

Flexible loading based on argument format: - Auto-match: `--baseline qwen25vl` (matches current task) - Explicit: `--baseline qwen25vl:mmbench` (specific task) - Local: `--baseline /path/to/results.jsonl` - HuggingFace: `--baseline hf://user/repo/file.jsonl`

1. Key Components

1. 1. Registry Management

- **BASELINE_REGISTRY**: Central dictionary mapping models to tasks - **Model Metadata**: Model name, HuggingFace repo, version info - **Task Entries**: Per-task baseline locations and descriptions - **Extensibility**: Easy to add new models and tasks

1. 1. Result Loading

- **load_baseline()**: Main entry point for loading baselines - **Format Detection**: Automatically determines source type - **Sample Mapping**: Creates doc_id → score/result mapping - **Aggregation Loading**: Loads summary statistics if available

1. 1. Score Extraction

- **Flexible Parsing**: Handles various result formats - **Score Detection**: Finds score fields by convention - **Prediction Matching**: Computes scores from pred/target pairs - **Fallback Logic**: Multiple strategies for extracting scores

1. Related Principles

- Post_Processing_and_Metrics: Baseline scores compared to computed metrics - Results_Output: Output includes baseline comparisons - Task_Directory_Structure: Task names used for matching baselines

1. Implementations

- Baseline_Loader: Loading logic for various baseline sources - Baseline_Registry: Central registry of available baselines

Implementation:EvolvingLMMs_Lab_Lmms_eval_Baseline_Loader

1. Design Considerations

1. 1. Flexibility

- Support multiple input formats (preset, path, URL) - Handle various result file structures - Graceful fallbacks for score extraction - Extensible registry structure

1. 1. Performance

- Lazy loading (only load when needed) - Efficient JSONL parsing (line-by-line) - Caching of downloaded files (HuggingFace Hub) - Minimal memory footprint

1. 1. Maintainability

- Centralized registry for easy updates - Clear separation of loading strategies - Consistent error messages - Documentation in registry structure

1. 1. Reliability

- Validate baseline availability before evaluation - Clear error messages for missing baselines - Fallback strategies for score extraction - Handle missing or malformed data

1. Usage Example

```python

Load baseline by preset name (auto-match task)

doc_scores, agg_results = load_baseline("qwen25vl", task_name="videomme")

Load baseline with explicit task

doc_scores, agg_results = load_baseline("qwen25vl:videomme", task_name="videomme")

Load from local file

doc_scores, agg_results = load_baseline("/path/to/baseline.jsonl", task_name="videomme")

Load from HuggingFace

doc_scores, agg_results = load_baseline("hf://user/repo/file.jsonl", task_name="videomme")

Use in evaluation

for doc_id, result in evaluation_results.items():

   baseline_score = doc_scores.get(doc_id)
   if baseline_score is not None:
       comparison = result["score"] - baseline_score
       print(f"Doc {doc_id}: Current={result['score']:.3f}, Baseline={baseline_score:.3f}, Diff={comparison:+.3f}")

```

1. Registry Configuration

```python BASELINE_REGISTRY = {

   "qwen25vl": {
       "_meta": {
           "model": "Qwen2.5-VL-7B-Instruct",
           "hf_repo": "mwxely/lmms-eval-test",
       },
       "videomme": {
           "hf_url": "hf://mwxely/lmms-eval-test/results.jsonl",
           "description": "VideoMME w/ subtitle",
       },
       "mmbench": {
           "hf_url": "hf://user/repo/mmbench_results.jsonl",
           "description": "MMBench evaluation",
       },
   },
   "gpt4v": {
       "_meta": {
           "model": "GPT-4V",
       },
       "videomme": {
           "path": "/shared/baselines/gpt4v_videomme.jsonl",
           "description": "GPT-4V on VideoMME",
       },
   },

} ```

1. File Format Specifications

1. 1. JSONL Sample File Format

```jsonl {"doc_id": 0, "score": 0.85, "pred_answer": "A", "answer": "A"} {"doc_id": 1, "score": 0.0, "pred_answer": "B", "answer": "C"} {"doc_id": 2, "score": 1.0, "pred_answer": "D", "answer": "D"} ```

1. 1. Aggregated Results Format

```json {

 "results": {
   "videomme": {
     "acc": 0.6833,
     "acc_stderr": 0.0123,
     "num_samples": 900
   }
 },
 "config": {
   "model": "qwen25vl",
   "timestamp": "2024-11-11T20:21:27"
 }

} ```

1. Common Patterns

1. 1. Adding a New Baseline

```python

1. Add to registry

BASELINE_REGISTRY["new_model"] = {

   "_meta": {
       "model": "New Model Name",
       "hf_repo": "org/baseline-repo",
   },
   "task_name": {
       "hf_url": "hf://org/baseline-repo/results.jsonl",
       "description": "Description of baseline",
   },

}

2. Use in evaluation

python -m lmms_eval --model new_model --tasks task_name --baseline new_model ```

1. 1. Custom Score Extraction

```python def _extract_score_from_sample(sample, task_name):

   # Task-specific extraction logic
   if task_name == "custom_task":
       return sample.get("custom_score_field")

   # Standard extraction
   for key in sample:
       if "score" in key.lower():
           val = sample[key]
           if isinstance(val, (int, float)):
               return float(val)

   # Compute from prediction
   if "pred" in sample and "target" in sample:
       return 1.0 if sample["pred"] == sample["target"] else 0.0

   return None

```

1. Integration Points

1. 1. Command-Line Interface

```bash

Auto-match task from registry

python -m lmms_eval --model qwen25vl --tasks videomme --baseline qwen25vl

Explicit task specification

python -m lmms_eval --model new_model --tasks mmbench --baseline qwen25vl:mmbench

Local file

python -m lmms_eval --model new_model --tasks videomme --baseline /path/to/baseline.jsonl

HuggingFace URL

python -m lmms_eval --model new_model --tasks videomme --baseline hf://user/repo/file.jsonl ```

1. 1. Results Output

Baseline comparisons included in output: ```json {

 "results": {
   "videomme": {
     "acc": 0.7012,
     "baseline_acc": 0.6833,
     "improvement": 0.0179,
     "per_sample_comparison": {
       "better": 45,
       "same": 820,
       "worse": 35
     }
   }
 }

} ```

1. Best Practices

1. Use descriptive model and task names in registry 2. Include metadata (paper references, dates, versions) 3. Store baselines in stable locations (HuggingFace preferred) 4. Include both per-sample and aggregated results 5. Document baseline conditions (prompts, settings) 6. Use consistent doc_id mapping across evaluations 7. Validate baseline availability before long evaluations 8. Version baseline results with timestamps or commits 9. Provide descriptions for each baseline entry 10. Test baseline loading before committing to registry

1. Error Handling

- Missing baseline: Clear error with available options - Missing task: List available tasks for model - Format errors: Graceful fallback with warnings - Network errors: Retry with informative messages - Score extraction failures: Log and skip samples

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment