Implementation:EvolvingLMMs Lab Lmms eval VideoMathQA CoT Postprocess

Source File: `lmms_eval/tasks/videomathqa/cot_postprocess.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VideoMathQA CoT Postprocess module provides a standalone tool for postprocessing chain-of-thought (CoT) reasoning outputs from VideoMathQA evaluations. It uses a language model (Qwen3) to extract the final answer choice from verbose reasoning outputs, then computes accuracy scores. This addresses cases where models provide correct reasoning but don't format the final answer properly.

Key Functions

Choice Extraction

extract_choice_vllm(llm, sampling_params, tokenizer, model_prompt, mcq=True)

Extracts answer choice from reasoning text using LLM

Takes reasoning text as input
Uses appropriate prompt template (MCQ or binary)
Formats as chat message with instructions
Applies chat template with enable_thinking=False
Generates response using vLLM sampling
Validates response format:
- MCQ: Must match regex [A-E]
- Binary: Must match regex [A-B]
Returns validated choice letter or None

Sample Refinement

refine_samples_vllm(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl, mcq=True)

Refines all samples in a JSONL file

Loads samples from input JSONL file
Iterates through samples with progress bar
For each sample:
- Constructs input with options and model response
- Attempts to extract choice using LLM
- Falls back to random wrong answer if extraction fails:
  - Removes correct answer from options
  - Randomly selects from remaining options
- Updates sample response with extracted/fallback choice
Saves updated samples to output JSONL
Returns list of updated samples

Postprocessing Pipeline

postprocess_jsonl(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl)

Complete postprocessing and scoring pipeline

Determines question type from filename (mcq or mbin)
Refines samples using refine_samples_vllm
Processes refined samples for scoring:
- Extracts prediction from response
- Cleans using extract_characters_regex
- Computes per-sample score using videomathqa_process_results
Aggregates scores using appropriate function:
- MCQ: videomathqa_mcq_aggregate_results
- Binary: videomathqa_multi_binary_aggregate_results
Prints final score
Saves refined samples to output file

Main Entry Point

main()

Command-line interface for postprocessing

Parses arguments:
- --input_file: Path to input JSONL
- --output_file: Path for output JSONL
- --model_path: Model path (default: "Qwen/Qwen3-4B")
Validates input file exists
Skips if output file already exists
Loads tokenizer and LLM model
Configures sampling parameters:
- Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Min-p: 0
- Max tokens: 16
Runs postprocessing pipeline
Logs completion

Prompt Templates

MCQ Prompt

Given the original multiple-choice options and a model-generated answer
containing reasoning and a final answer, identify the option that best
matches the final answer and return only the corresponding letter
(A, B, C, D, or E).

Binary Prompt

Given the original binary options and a model-generated answer containing
reasoning and a final answer, identify the option that best matches the
final answer and return only the corresponding letter (A or B).

Sampling Configuration

SamplingParams(
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    min_p=0,
    max_tokens=16
)

Low max_tokens (16) since only extracting single letter.

Design Characteristics

LLM-Based Extraction: Uses language model to interpret reasoning and extract answer
Fallback Strategy: Random wrong answer when extraction fails (avoids unfair advantage)
Format Validation: Regex checks ensure proper answer format
Type Detection: Automatically determines MCQ vs binary from filename
Progress Tracking: Uses tqdm for progress bars during processing
Standalone Tool: Can be run independently on evaluation outputs
Idempotent: Skips processing if output file already exists

Dependencies

argparse - Command-line argument parsing
json - JSONL file operations
os - File system checks
random - Fallback answer selection
re - Answer format validation
tqdm - Progress bars
transformers.AutoTokenizer - Tokenizer loading
vllm - LLM inference (LLM, SamplingParams)
videomathqa.utils - Task-specific utilities for scoring

Usage Context

This tool is used as a post-hoc processing step after running VideoMathQA evaluation with chain-of-thought prompting. When models generate extensive reasoning but fail to clearly format the final answer choice, this tool extracts the intended answer using a separate language model, enabling fair scoring of reasoning-focused evaluations.

Example Usage

python cot_postprocess.py \
    --input_file results_mcq.jsonl \
    --output_file results_mcq_postprocessed.jsonl \
    --model_path Qwen/Qwen3-4B

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment