Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sail sg LongSpec OpenAI Eval Callbacks

From Leeroopedia
Knowledge Sources
Domains NLP, Evaluation, Mathematics
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for collecting, evaluating, and scoring model predictions via callback classes supporting MCQA, math, code, and MathScale benchmarks.

Description

The openai_api_callback.py module provides a hierarchy of evaluation callback classes used in the post-processing pipeline. OpenAICallBack is the base class handling vLLM output parsing, resumable logging, and MCQA evaluation. Specialized subclasses include OpenAIMATHCallBack (MetaMath equivalence with majority voting), DeepSeekMathCallBack (DeepSeek-Math evaluation for GSM8K/MATH), MathScaleCallBack (MathScale equivalence), and SaveOnlyCallBack (logging without evaluation). The module also provides answer cleaning classes: MCQAAnswerClean, SeparatorClean, ReActSeparatorClean, and BinaryAnswerClean.

Usage

Import these callback classes when configuring the post-processing evaluation pipeline for math and QA benchmarks. They are instantiated via Hydra configuration and attached to the evaluator pipeline to collect and score model outputs.

Code Reference

Source Location

Signature

class OpenAICallBack:
    def __init__(
        self,
        output_file: str,
        answer_clean: Union[MCQAAnswerClean, str],
        resume: bool = False,
        index_field: str = "index",
        label_field: str = "label",
        saved_keys: List[str] = None,
    ):
        """Base callback for collecting predictions, cleaning answers, and computing accuracy."""

    def __call__(self, meta_data: Dict[str, Any], batch_model_outputs: Dict[str, Any], **kwargs):
        """Process single batch: parse vLLM output, clean answer, log to file."""

    def get_results(self) -> Tuple[dict, list]:
        """Compute acc, pass@k metrics and save results."""

class OpenAIMATHCallBack(OpenAICallBack):
    """Math evaluation with MetaMath equivalence and majority voting (maj@k)."""

class DeepSeekMathCallBack(OpenAICallBack):
    """DeepSeek-Math evaluation for GSM8K/MATH with custom extraction and eval functions."""

class MathScaleCallBack(OpenAICallBack):
    """MathScale evaluation with inline answer extraction and equivalence checking."""

class MCQAAnswerClean:
    def __init__(self, prompt: str = "zero-shot"):
        """Extract A/B/C/D/E answers from model output."""

class ReActSeparatorClean:
    def __init__(self, separator: str = "Context:", separate_idx: int = 0, regrex: str = "A|B|C|D"):
        """Extract answers from ReAct-style Finish[X] outputs."""

Import

from post_processors.openai_api_callback import (
    OpenAICallBack, OpenAIMATHCallBack, DeepSeekMathCallBack,
    MathScaleCallBack, MCQAAnswerClean, ReActSeparatorClean
)

I/O Contract

Inputs

Name Type Required Description
output_file str Yes Path to save results JSON
answer_clean Callable Yes Answer cleaning function or class
meta_data Dict Per-call Contains text, label, index fields
batch_model_outputs Dict Per-call Contains "response" (str or vllm.RequestOutput)
resume bool No Whether to resume from existing log file

Outputs

Name Type Description
metrics dict Contains acc, pass@k, maj@k (for math), correct, total
predictions list[dict] All predictions with res, pred, sc_pred, sc_res fields

Usage Examples

from post_processors.openai_api_callback import OpenAIMATHCallBack, MCQAAnswerClean

# Math evaluation callback
math_callback = OpenAIMATHCallBack(
    output_file="results/math_eval.json",
    answer_clean=MCQAAnswerClean(prompt="zero-shot"),
    eval_fn="meta_math",
)

# Process a prediction
math_callback(
    meta_data={"text": "What is 2+2?", "label": "4", "index": 0},
    batch_model_outputs={"response": "The answer is 4."}
)

# Get final metrics
metrics, _ = math_callback.get_results()
print(metrics)  # {"acc": 1.0, "pass@k": 1.0, "maj@k": 1.0, "correct": 1, "total": 1}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment