Implementation:EvolvingLMMs Lab Lmms eval IFEval Utils
Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/ifeval/utils.py
Principle: Task_Utility_Functions
Purpose
Evaluation utilities for IFEval (Instruction Following Evaluation) with strict and loose scoring modes.
Data Classes
InputExample
@dataclasses.dataclass
Fields:
key: int- unique identifierinstruction_id_list: list[str]- list of instruction IDs to checkprompt: str- original promptkwargs: list[Dict[str, Optional[Union[str, int]]]]- instruction parameters
OutputExample
@dataclasses.dataclass
Fields:
instruction_id_list: list[str]prompt: strresponse: strfollow_all_instructions: bool- all instructions followedfollow_instruction_list: list[bool]- per-instruction results
Key Functions
test_instruction_following_strict
def test_instruction_following_strict(inp, response)
Strict evaluation mode:
- Tests each instruction in instruction_id_list
- Retrieves checker class from instructions_registry
- Builds instruction description with kwargs (filtering None values)
- Checks if response follows instruction
- Returns OutputExample with boolean results
test_instruction_following_loose
def test_instruction_following_loose(inp, response)
Loose evaluation mode providing upper bound:
- Creates 8 response variations (original, remove first/last/both lines, remove asterisks)
- Tests each instruction against all response variations
- Marks instruction as followed if ANY variation passes
- More lenient scoring to account for formatting variations
Response variations:
- Original response
- Revised (asterisks removed)
- Remove first line
- Remove last line
- Remove both first and last lines
- Combinations of above with asterisk removal
process_results
def process_results(doc, results)
Main result processor:
- Creates InputExample from document
- Runs both strict and loose evaluation
- Returns 4 metrics:
prompt_level_strict_acc- all instructions followed (strict)inst_level_strict_acc- per-instruction results (strict)prompt_level_loose_acc- all instructions followed (loose)inst_level_loose_acc- per-instruction results (loose)
- Logs warning about chat-finetuned model requirement
agg_inst_level_acc
def agg_inst_level_acc(items)
Aggregates instruction-level accuracy by flattening nested lists and computing average.
Implementation Details
- Filters None kwargs to avoid unexpected keyword argument errors
- Handles special "prompt" argument for certain instruction types
- Requires strip() on response before checking
- Two-level evaluation: prompt-level (all pass) and instruction-level (individual)