Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval IFEval Utils

From Leeroopedia
Revision as of 12:31, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_IFEval_Utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/ifeval/utils.py

Principle: Task_Utility_Functions

Purpose

Evaluation utilities for IFEval (Instruction Following Evaluation) with strict and loose scoring modes.

Data Classes

InputExample

@dataclasses.dataclass

Fields:

  • key: int - unique identifier
  • instruction_id_list: list[str] - list of instruction IDs to check
  • prompt: str - original prompt
  • kwargs: list[Dict[str, Optional[Union[str, int]]]] - instruction parameters

OutputExample

@dataclasses.dataclass

Fields:

  • instruction_id_list: list[str]
  • prompt: str
  • response: str
  • follow_all_instructions: bool - all instructions followed
  • follow_instruction_list: list[bool] - per-instruction results

Key Functions

test_instruction_following_strict

def test_instruction_following_strict(inp, response)

Strict evaluation mode:

  • Tests each instruction in instruction_id_list
  • Retrieves checker class from instructions_registry
  • Builds instruction description with kwargs (filtering None values)
  • Checks if response follows instruction
  • Returns OutputExample with boolean results

test_instruction_following_loose

def test_instruction_following_loose(inp, response)

Loose evaluation mode providing upper bound:

  • Creates 8 response variations (original, remove first/last/both lines, remove asterisks)
  • Tests each instruction against all response variations
  • Marks instruction as followed if ANY variation passes
  • More lenient scoring to account for formatting variations

Response variations:

  • Original response
  • Revised (asterisks removed)
  • Remove first line
  • Remove last line
  • Remove both first and last lines
  • Combinations of above with asterisk removal

process_results

def process_results(doc, results)

Main result processor:

  • Creates InputExample from document
  • Runs both strict and loose evaluation
  • Returns 4 metrics:
    • prompt_level_strict_acc - all instructions followed (strict)
    • inst_level_strict_acc - per-instruction results (strict)
    • prompt_level_loose_acc - all instructions followed (loose)
    • inst_level_loose_acc - per-instruction results (loose)
  • Logs warning about chat-finetuned model requirement

agg_inst_level_acc

def agg_inst_level_acc(items)

Aggregates instruction-level accuracy by flattening nested lists and computing average.

Implementation Details

  • Filters None kwargs to avoid unexpected keyword argument errors
  • Handles special "prompt" argument for certain instruction types
  • Requires strip() on response before checking
  • Two-level evaluation: prompt-level (all pass) and instruction-level (individual)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment