Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval VideoMathQA CoT Postprocess

From Leeroopedia

Source File: `lmms_eval/tasks/videomathqa/cot_postprocess.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VideoMathQA CoT Postprocess module provides a standalone tool for postprocessing chain-of-thought (CoT) reasoning outputs from VideoMathQA evaluations. It uses a language model (Qwen3) to extract the final answer choice from verbose reasoning outputs, then computes accuracy scores. This addresses cases where models provide correct reasoning but don't format the final answer properly.

Key Functions

Choice Extraction

extract_choice_vllm(llm, sampling_params, tokenizer, model_prompt, mcq=True)
Extracts answer choice from reasoning text using LLM
  • Takes reasoning text as input
  • Uses appropriate prompt template (MCQ or binary)
  • Formats as chat message with instructions
  • Applies chat template with enable_thinking=False
  • Generates response using vLLM sampling
  • Validates response format:
    • MCQ: Must match regex [A-E]
    • Binary: Must match regex [A-B]
  • Returns validated choice letter or None

Sample Refinement

refine_samples_vllm(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl, mcq=True)
Refines all samples in a JSONL file
  • Loads samples from input JSONL file
  • Iterates through samples with progress bar
  • For each sample:
    • Constructs input with options and model response
    • Attempts to extract choice using LLM
    • Falls back to random wrong answer if extraction fails:
      • Removes correct answer from options
      • Randomly selects from remaining options
    • Updates sample response with extracted/fallback choice
  • Saves updated samples to output JSONL
  • Returns list of updated samples

Postprocessing Pipeline

postprocess_jsonl(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl)
Complete postprocessing and scoring pipeline
  • Determines question type from filename (mcq or mbin)
  • Refines samples using refine_samples_vllm
  • Processes refined samples for scoring:
    • Extracts prediction from response
    • Cleans using extract_characters_regex
    • Computes per-sample score using videomathqa_process_results
  • Aggregates scores using appropriate function:
    • MCQ: videomathqa_mcq_aggregate_results
    • Binary: videomathqa_multi_binary_aggregate_results
  • Prints final score
  • Saves refined samples to output file

Main Entry Point

main()
Command-line interface for postprocessing
  • Parses arguments:
    • --input_file: Path to input JSONL
    • --output_file: Path for output JSONL
    • --model_path: Model path (default: "Qwen/Qwen3-4B")
  • Validates input file exists
  • Skips if output file already exists
  • Loads tokenizer and LLM model
  • Configures sampling parameters:
    • Temperature: 0.7
    • Top-p: 0.8
    • Top-k: 20
    • Min-p: 0
    • Max tokens: 16
  • Runs postprocessing pipeline
  • Logs completion

Prompt Templates

MCQ Prompt

Given the original multiple-choice options and a model-generated answer
containing reasoning and a final answer, identify the option that best
matches the final answer and return only the corresponding letter
(A, B, C, D, or E).

Binary Prompt

Given the original binary options and a model-generated answer containing
reasoning and a final answer, identify the option that best matches the
final answer and return only the corresponding letter (A or B).

Sampling Configuration

SamplingParams(
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    min_p=0,
    max_tokens=16
)

Low max_tokens (16) since only extracting single letter.

Design Characteristics

  • LLM-Based Extraction: Uses language model to interpret reasoning and extract answer
  • Fallback Strategy: Random wrong answer when extraction fails (avoids unfair advantage)
  • Format Validation: Regex checks ensure proper answer format
  • Type Detection: Automatically determines MCQ vs binary from filename
  • Progress Tracking: Uses tqdm for progress bars during processing
  • Standalone Tool: Can be run independently on evaluation outputs
  • Idempotent: Skips processing if output file already exists

Dependencies

  • argparse - Command-line argument parsing
  • json - JSONL file operations
  • os - File system checks
  • random - Fallback answer selection
  • re - Answer format validation
  • tqdm - Progress bars
  • transformers.AutoTokenizer - Tokenizer loading
  • vllm - LLM inference (LLM, SamplingParams)
  • videomathqa.utils - Task-specific utilities for scoring

Usage Context

This tool is used as a post-hoc processing step after running VideoMathQA evaluation with chain-of-thought prompting. When models generate extensive reasoning but fail to clearly format the final answer choice, this tool extracts the intended answer using a separate language model, enabling fair scoring of reasoning-focused evaluations.

Example Usage

python cot_postprocess.py \
    --input_file results_mcq.jsonl \
    --output_file results_mcq_postprocessed.jsonl \
    --model_path Qwen/Qwen3-4B

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment