Implementation:EvolvingLMMs Lab Lmms eval VideoMathQA CoT Step Evaluation

Source File: `lmms_eval/tasks/videomathqa/cot_step_evaluation.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VideoMathQA CoT Step Evaluation module provides fine-grained evaluation of chain-of-thought reasoning for math problem solving. It uses a language model (Qwen3) to assess how well model-generated reasoning steps match ground truth solution steps, assigning scores from 0-10 based on a detailed rubric. This enables nuanced evaluation beyond just final answer correctness.

Key Functions

Data Preparation

prepare_input(sample, matched)

Prepares input data for evaluation

Extracts question metadata from ground truth
Extracts prediction from result data
Creates unified input dictionary with:
- Question ID, category, question text
- Multiple-choice options
- Ground truth answer and solution steps
- Model prediction
Returns prepared input dictionary

prepare_batch_prompts(batch)

Prepares prompts for batch processing

Iterates through batch of samples
Generates user prompt for each using get_user_prompt
Returns list of formatted prompts

get_user_prompt(question, options, gt_steps, gt_answer, prediction)

Constructs evaluation prompt

Includes system prompt with scoring rubric
Provides question, options, and ground truth
Includes model prediction to evaluate
Specifies required output format
Returns formatted prompt string

Response Parsing

safe_parse_response(reply)

Parses model response into structured format

First attempts JSON parsing
Falls back to ast.literal_eval if JSON fails
Returns parsed dictionary or None on failure
Handles parsing errors gracefully

Scoring Pipeline

compute_score(gt_data, res_data, res_file, tokenizer, llm, sampling_params, bs=64)

Main scoring computation function

Processes samples in batches (default batch size: 8)
For each sample:
- Finds matching result by question ID
- Prepares input data
- Accumulates into batch
When batch full:
- Generates batch prompts
- Formats as chat messages with thinking enabled
- Applies chat templates
- Generates responses using vLLM
- Parses responses to extract score dictionaries
- Handles errors with fallback score (0)
Collects scored samples with metadata
Saves detailed results to JSONL file
Computes mean score across all samples
Returns aggregated score

Main Entry Point

main()

Command-line interface for step evaluation

Parses arguments:
- --model_path: Evaluator model (default: "Qwen/Qwen3-4B")
- --gt_file: Ground truth Parquet file
- --res_file: Results JSONL file
Loads tokenizer and LLM model
Configures sampling parameters:
- Temperature: 0.6
- Top-p: 0.95
- Top-k: 20
- Min-p: 0
- Max tokens: 32768
Loads ground truth from Parquet
Loads results from JSONL
Runs scoring computation with batch size 8
Outputs final step evaluation score

Scoring System

System Prompt

The system prompt defines a detailed rubric with four main criteria:

1. Relative Step Matching (Main Criterion)

Count total ground truth steps: N
Evaluate how many predicted steps align with ground truth
Score = (matching steps / N) × 10, rounded
Steps match if they serve same mathematical purpose

2. Correct Final Answer via Different Reasoning

If final answer correct and reasoning valid: full score of 10
Ignore step matching if alternative reasoning is sound
Reduce score proportionally for flawed observations
Reward partially correct reasoning on valid paths

3. Implicit or Inferred Steps

Don't penalize skipped early steps if later logic depends on them
Credit steps that were likely understood implicitly
Check for implied steps before reducing score

4. Ignore Superficial Differences

Don't deduct for formatting or notation differences
Focus on underlying mathematical meaning
Don't require literal step-by-step matching

Output Format

SCORE_CARD: {
    "matched_steps": "X/N",
    "final_answer_correct": 0 or 1,
    "critique": "<2-3 sentence summary>",
    "score": <0-10>
}

Default Fallback Score

{
    "matched_steps": "0/0",
    "final_answer_correct": 0,
    "critique": "Error",
    "score": 0
}

Sampling Configuration

SamplingParams(
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    min_p=0,
    max_tokens=32768
)

High max_tokens (32768) to allow detailed reasoning about evaluation.

Output Files

Scored Samples File

Generated filename format:

{original_filename}_step_scored_samples_qwen3_batch_think.jsonl

Each line contains:

Original sample data (qid, category, question, etc.)
Score dictionary with matched steps, correctness, critique, score
Raw model reply (if parsing failed)

Design Characteristics

Fine-Grained Evaluation: Scores reasoning quality on 0-10 scale
Nuanced Rubric: Rewards valid alternative reasoning and implicit steps
Batch Processing: Efficient evaluation using batched inference
Thinking Mode: Uses model's thinking capability for better evaluation
Error Handling: Graceful fallback when parsing fails
Detailed Output: Saves comprehensive evaluation metadata
Flexible Rubric: Balances strictness with recognition of valid alternatives

Dependencies

argparse - Command-line argument parsing
ast - Abstract syntax tree parsing for fallback
json - JSON operations
os - File system operations
pandas - Loading Parquet ground truth data
tqdm - Progress tracking
transformers.AutoTokenizer - Tokenizer loading
vllm - LLM inference (LLM, SamplingParams)

Usage Context

This tool provides detailed evaluation of mathematical reasoning in VideoMathQA. Rather than just checking if the final answer is correct, it assesses the quality of the reasoning process, awarding partial credit for correct steps and valid alternative approaches. This is particularly valuable for understanding model capabilities in multi-step mathematical problem solving.

Example Usage

python cot_step_evaluation.py \
    --model_path Qwen/Qwen3-4B \
    --gt_file videomathqa_val.parquet \
    --res_file results_cot.jsonl

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment