Implementation:EvolvingLMMs Lab Lmms eval VideoMathQA CoT Postprocess
Source File: `lmms_eval/tasks/videomathqa/cot_postprocess.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The VideoMathQA CoT Postprocess module provides a standalone tool for postprocessing chain-of-thought (CoT) reasoning outputs from VideoMathQA evaluations. It uses a language model (Qwen3) to extract the final answer choice from verbose reasoning outputs, then computes accuracy scores. This addresses cases where models provide correct reasoning but don't format the final answer properly.
Key Functions
Choice Extraction
extract_choice_vllm(llm, sampling_params, tokenizer, model_prompt, mcq=True)- Extracts answer choice from reasoning text using LLM
- Takes reasoning text as input
- Uses appropriate prompt template (MCQ or binary)
- Formats as chat message with instructions
- Applies chat template with
enable_thinking=False - Generates response using vLLM sampling
- Validates response format:
- MCQ: Must match regex
[A-E] - Binary: Must match regex
[A-B]
- MCQ: Must match regex
- Returns validated choice letter or None
Sample Refinement
refine_samples_vllm(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl, mcq=True)- Refines all samples in a JSONL file
- Loads samples from input JSONL file
- Iterates through samples with progress bar
- For each sample:
- Constructs input with options and model response
- Attempts to extract choice using LLM
- Falls back to random wrong answer if extraction fails:
- Removes correct answer from options
- Randomly selects from remaining options
- Updates sample response with extracted/fallback choice
- Saves updated samples to output JSONL
- Returns list of updated samples
Postprocessing Pipeline
postprocess_jsonl(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl)- Complete postprocessing and scoring pipeline
- Determines question type from filename (mcq or mbin)
- Refines samples using
refine_samples_vllm - Processes refined samples for scoring:
- Extracts prediction from response
- Cleans using
extract_characters_regex - Computes per-sample score using
videomathqa_process_results
- Aggregates scores using appropriate function:
- MCQ:
videomathqa_mcq_aggregate_results - Binary:
videomathqa_multi_binary_aggregate_results
- MCQ:
- Prints final score
- Saves refined samples to output file
Main Entry Point
main()- Command-line interface for postprocessing
- Parses arguments:
--input_file: Path to input JSONL--output_file: Path for output JSONL--model_path: Model path (default: "Qwen/Qwen3-4B")
- Validates input file exists
- Skips if output file already exists
- Loads tokenizer and LLM model
- Configures sampling parameters:
- Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Min-p: 0
- Max tokens: 16
- Runs postprocessing pipeline
- Logs completion
- Parses arguments:
Prompt Templates
MCQ Prompt
Given the original multiple-choice options and a model-generated answer
containing reasoning and a final answer, identify the option that best
matches the final answer and return only the corresponding letter
(A, B, C, D, or E).
Binary Prompt
Given the original binary options and a model-generated answer containing
reasoning and a final answer, identify the option that best matches the
final answer and return only the corresponding letter (A or B).
Sampling Configuration
SamplingParams(
temperature=0.7,
top_p=0.8,
top_k=20,
min_p=0,
max_tokens=16
)
Low max_tokens (16) since only extracting single letter.
Design Characteristics
- LLM-Based Extraction: Uses language model to interpret reasoning and extract answer
- Fallback Strategy: Random wrong answer when extraction fails (avoids unfair advantage)
- Format Validation: Regex checks ensure proper answer format
- Type Detection: Automatically determines MCQ vs binary from filename
- Progress Tracking: Uses tqdm for progress bars during processing
- Standalone Tool: Can be run independently on evaluation outputs
- Idempotent: Skips processing if output file already exists
Dependencies
argparse- Command-line argument parsingjson- JSONL file operationsos- File system checksrandom- Fallback answer selectionre- Answer format validationtqdm- Progress barstransformers.AutoTokenizer- Tokenizer loadingvllm- LLM inference (LLM, SamplingParams)videomathqa.utils- Task-specific utilities for scoring
Usage Context
This tool is used as a post-hoc processing step after running VideoMathQA evaluation with chain-of-thought prompting. When models generate extensive reasoning but fail to clearly format the final answer choice, this tool extracts the intended answer using a separate language model, enabling fair scoring of reasoning-focused evaluations.
Example Usage
python cot_postprocess.py \
--input_file results_mcq.jsonl \
--output_file results_mcq_postprocessed.jsonl \
--model_path Qwen/Qwen3-4B