Implementation:EvolvingLMMs Lab Lmms eval IFEval Utils

Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/ifeval/utils.py

Principle: Task_Utility_Functions

Purpose

Evaluation utilities for IFEval (Instruction Following Evaluation) with strict and loose scoring modes.

@dataclasses.dataclass

Fields:

@dataclasses.dataclass

Fields:

def test_instruction_following_strict(inp, response)

Strict evaluation mode:

def test_instruction_following_loose(inp, response)

Loose evaluation mode providing upper bound:

Creates 8 response variations (original, remove first/last/both lines, remove asterisks)
Tests each instruction against all response variations
Marks instruction as followed if ANY variation passes
More lenient scoring to account for formatting variations

Response variations:

def process_results(doc, results)

Main result processor:

Creates InputExample from document
Runs both strict and loose evaluation
Returns 4 metrics:
- prompt_level_strict_acc - all instructions followed (strict)
- inst_level_strict_acc - per-instruction results (strict)
- prompt_level_loose_acc - all instructions followed (loose)
- inst_level_loose_acc - per-instruction results (loose)
Logs warning about chat-finetuned model requirement

def agg_inst_level_acc(items)

Aggregates instruction-level accuracy by flattening nested lists and computing average.

Filters None kwargs to avoid unexpected keyword argument errors
Handles special "prompt" argument for certain instruction types
Requires strip() on response before checking
Two-level evaluation: prompt-level (all pass) and instruction-level (individual)

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment