Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval VideoMathQA CoT Step Evaluation

From Leeroopedia

Source File: `lmms_eval/tasks/videomathqa/cot_step_evaluation.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VideoMathQA CoT Step Evaluation module provides fine-grained evaluation of chain-of-thought reasoning for math problem solving. It uses a language model (Qwen3) to assess how well model-generated reasoning steps match ground truth solution steps, assigning scores from 0-10 based on a detailed rubric. This enables nuanced evaluation beyond just final answer correctness.

Key Functions

Data Preparation

prepare_input(sample, matched)
Prepares input data for evaluation
  • Extracts question metadata from ground truth
  • Extracts prediction from result data
  • Creates unified input dictionary with:
    • Question ID, category, question text
    • Multiple-choice options
    • Ground truth answer and solution steps
    • Model prediction
  • Returns prepared input dictionary
prepare_batch_prompts(batch)
Prepares prompts for batch processing
  • Iterates through batch of samples
  • Generates user prompt for each using get_user_prompt
  • Returns list of formatted prompts
get_user_prompt(question, options, gt_steps, gt_answer, prediction)
Constructs evaluation prompt
  • Includes system prompt with scoring rubric
  • Provides question, options, and ground truth
  • Includes model prediction to evaluate
  • Specifies required output format
  • Returns formatted prompt string

Response Parsing

safe_parse_response(reply)
Parses model response into structured format
  • First attempts JSON parsing
  • Falls back to ast.literal_eval if JSON fails
  • Returns parsed dictionary or None on failure
  • Handles parsing errors gracefully

Scoring Pipeline

compute_score(gt_data, res_data, res_file, tokenizer, llm, sampling_params, bs=64)
Main scoring computation function
  • Processes samples in batches (default batch size: 8)
  • For each sample:
    • Finds matching result by question ID
    • Prepares input data
    • Accumulates into batch
  • When batch full:
    • Generates batch prompts
    • Formats as chat messages with thinking enabled
    • Applies chat templates
    • Generates responses using vLLM
    • Parses responses to extract score dictionaries
    • Handles errors with fallback score (0)
  • Collects scored samples with metadata
  • Saves detailed results to JSONL file
  • Computes mean score across all samples
  • Returns aggregated score

Main Entry Point

main()
Command-line interface for step evaluation
  • Parses arguments:
    • --model_path: Evaluator model (default: "Qwen/Qwen3-4B")
    • --gt_file: Ground truth Parquet file
    • --res_file: Results JSONL file
  • Loads tokenizer and LLM model
  • Configures sampling parameters:
    • Temperature: 0.6
    • Top-p: 0.95
    • Top-k: 20
    • Min-p: 0
    • Max tokens: 32768
  • Loads ground truth from Parquet
  • Loads results from JSONL
  • Runs scoring computation with batch size 8
  • Outputs final step evaluation score

Scoring System

System Prompt

The system prompt defines a detailed rubric with four main criteria:

1. Relative Step Matching (Main Criterion)

  • Count total ground truth steps: N
  • Evaluate how many predicted steps align with ground truth
  • Score = (matching steps / N) × 10, rounded
  • Steps match if they serve same mathematical purpose

2. Correct Final Answer via Different Reasoning

  • If final answer correct and reasoning valid: full score of 10
  • Ignore step matching if alternative reasoning is sound
  • Reduce score proportionally for flawed observations
  • Reward partially correct reasoning on valid paths

3. Implicit or Inferred Steps

  • Don't penalize skipped early steps if later logic depends on them
  • Credit steps that were likely understood implicitly
  • Check for implied steps before reducing score

4. Ignore Superficial Differences

  • Don't deduct for formatting or notation differences
  • Focus on underlying mathematical meaning
  • Don't require literal step-by-step matching

Output Format

SCORE_CARD: {
    "matched_steps": "X/N",
    "final_answer_correct": 0 or 1,
    "critique": "<2-3 sentence summary>",
    "score": <0-10>
}

Default Fallback Score

{
    "matched_steps": "0/0",
    "final_answer_correct": 0,
    "critique": "Error",
    "score": 0
}

Sampling Configuration

SamplingParams(
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    min_p=0,
    max_tokens=32768
)

High max_tokens (32768) to allow detailed reasoning about evaluation.

Output Files

Scored Samples File

Generated filename format:

{original_filename}_step_scored_samples_qwen3_batch_think.jsonl

Each line contains:

  • Original sample data (qid, category, question, etc.)
  • Score dictionary with matched steps, correctness, critique, score
  • Raw model reply (if parsing failed)

Design Characteristics

  • Fine-Grained Evaluation: Scores reasoning quality on 0-10 scale
  • Nuanced Rubric: Rewards valid alternative reasoning and implicit steps
  • Batch Processing: Efficient evaluation using batched inference
  • Thinking Mode: Uses model's thinking capability for better evaluation
  • Error Handling: Graceful fallback when parsing fails
  • Detailed Output: Saves comprehensive evaluation metadata
  • Flexible Rubric: Balances strictness with recognition of valid alternatives

Dependencies

  • argparse - Command-line argument parsing
  • ast - Abstract syntax tree parsing for fallback
  • json - JSON operations
  • os - File system operations
  • pandas - Loading Parquet ground truth data
  • tqdm - Progress tracking
  • transformers.AutoTokenizer - Tokenizer loading
  • vllm - LLM inference (LLM, SamplingParams)

Usage Context

This tool provides detailed evaluation of mathematical reasoning in VideoMathQA. Rather than just checking if the final answer is correct, it assesses the quality of the reasoning process, awarding partial credit for correct steps and valid alternative approaches. This is particularly valuable for understanding model capabilities in multi-step mathematical problem solving.

Example Usage

python cot_step_evaluation.py \
    --model_path Qwen/Qwen3-4B \
    --gt_file videomathqa_val.parquet \
    --res_file results_cot.jsonl

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment