Implementation:EvolvingLMMs Lab Lmms eval VideoMathQA CoT Step Evaluation
Source File: `lmms_eval/tasks/videomathqa/cot_step_evaluation.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The VideoMathQA CoT Step Evaluation module provides fine-grained evaluation of chain-of-thought reasoning for math problem solving. It uses a language model (Qwen3) to assess how well model-generated reasoning steps match ground truth solution steps, assigning scores from 0-10 based on a detailed rubric. This enables nuanced evaluation beyond just final answer correctness.
Key Functions
Data Preparation
prepare_input(sample, matched)- Prepares input data for evaluation
- Extracts question metadata from ground truth
- Extracts prediction from result data
- Creates unified input dictionary with:
- Question ID, category, question text
- Multiple-choice options
- Ground truth answer and solution steps
- Model prediction
- Returns prepared input dictionary
prepare_batch_prompts(batch)- Prepares prompts for batch processing
- Iterates through batch of samples
- Generates user prompt for each using
get_user_prompt - Returns list of formatted prompts
get_user_prompt(question, options, gt_steps, gt_answer, prediction)- Constructs evaluation prompt
- Includes system prompt with scoring rubric
- Provides question, options, and ground truth
- Includes model prediction to evaluate
- Specifies required output format
- Returns formatted prompt string
Response Parsing
safe_parse_response(reply)- Parses model response into structured format
- First attempts JSON parsing
- Falls back to
ast.literal_evalif JSON fails - Returns parsed dictionary or None on failure
- Handles parsing errors gracefully
Scoring Pipeline
compute_score(gt_data, res_data, res_file, tokenizer, llm, sampling_params, bs=64)- Main scoring computation function
- Processes samples in batches (default batch size: 8)
- For each sample:
- Finds matching result by question ID
- Prepares input data
- Accumulates into batch
- When batch full:
- Generates batch prompts
- Formats as chat messages with thinking enabled
- Applies chat templates
- Generates responses using vLLM
- Parses responses to extract score dictionaries
- Handles errors with fallback score (0)
- Collects scored samples with metadata
- Saves detailed results to JSONL file
- Computes mean score across all samples
- Returns aggregated score
Main Entry Point
main()- Command-line interface for step evaluation
- Parses arguments:
--model_path: Evaluator model (default: "Qwen/Qwen3-4B")--gt_file: Ground truth Parquet file--res_file: Results JSONL file
- Loads tokenizer and LLM model
- Configures sampling parameters:
- Temperature: 0.6
- Top-p: 0.95
- Top-k: 20
- Min-p: 0
- Max tokens: 32768
- Loads ground truth from Parquet
- Loads results from JSONL
- Runs scoring computation with batch size 8
- Outputs final step evaluation score
- Parses arguments:
Scoring System
System Prompt
The system prompt defines a detailed rubric with four main criteria:
1. Relative Step Matching (Main Criterion)
- Count total ground truth steps: N
- Evaluate how many predicted steps align with ground truth
- Score = (matching steps / N) × 10, rounded
- Steps match if they serve same mathematical purpose
2. Correct Final Answer via Different Reasoning
- If final answer correct and reasoning valid: full score of 10
- Ignore step matching if alternative reasoning is sound
- Reduce score proportionally for flawed observations
- Reward partially correct reasoning on valid paths
3. Implicit or Inferred Steps
- Don't penalize skipped early steps if later logic depends on them
- Credit steps that were likely understood implicitly
- Check for implied steps before reducing score
4. Ignore Superficial Differences
- Don't deduct for formatting or notation differences
- Focus on underlying mathematical meaning
- Don't require literal step-by-step matching
Output Format
SCORE_CARD: {
"matched_steps": "X/N",
"final_answer_correct": 0 or 1,
"critique": "<2-3 sentence summary>",
"score": <0-10>
}
Default Fallback Score
{
"matched_steps": "0/0",
"final_answer_correct": 0,
"critique": "Error",
"score": 0
}
Sampling Configuration
SamplingParams(
temperature=0.6,
top_p=0.95,
top_k=20,
min_p=0,
max_tokens=32768
)
High max_tokens (32768) to allow detailed reasoning about evaluation.
Output Files
Scored Samples File
Generated filename format:
{original_filename}_step_scored_samples_qwen3_batch_think.jsonl
Each line contains:
- Original sample data (qid, category, question, etc.)
- Score dictionary with matched steps, correctness, critique, score
- Raw model reply (if parsing failed)
Design Characteristics
- Fine-Grained Evaluation: Scores reasoning quality on 0-10 scale
- Nuanced Rubric: Rewards valid alternative reasoning and implicit steps
- Batch Processing: Efficient evaluation using batched inference
- Thinking Mode: Uses model's thinking capability for better evaluation
- Error Handling: Graceful fallback when parsing fails
- Detailed Output: Saves comprehensive evaluation metadata
- Flexible Rubric: Balances strictness with recognition of valid alternatives
Dependencies
argparse- Command-line argument parsingast- Abstract syntax tree parsing for fallbackjson- JSON operationsos- File system operationspandas- Loading Parquet ground truth datatqdm- Progress trackingtransformers.AutoTokenizer- Tokenizer loadingvllm- LLM inference (LLM, SamplingParams)
Usage Context
This tool provides detailed evaluation of mathematical reasoning in VideoMathQA. Rather than just checking if the final answer is correct, it assesses the quality of the reasoning process, awarding partial credit for correct steps and valid alternative approaches. This is particularly valuable for understanding model capabilities in multi-step mathematical problem solving.
Example Usage
python cot_step_evaluation.py \
--model_path Qwen/Qwen3-4B \
--gt_file videomathqa_val.parquet \
--res_file results_cot.jsonl