Implementation:Sail sg LongSpec OpenAI Eval Callbacks

Knowledge Sources	Sail_sg_LongSpec
Domains	NLP, Evaluation, Mathematics
Last Updated	2026-02-14 05:00 GMT

Overview

Concrete tool for collecting, evaluating, and scoring model predictions via callback classes supporting MCQA, math, code, and MathScale benchmarks.

Description

The openai_api_callback.py module provides a hierarchy of evaluation callback classes used in the post-processing pipeline. OpenAICallBack is the base class handling vLLM output parsing, resumable logging, and MCQA evaluation. Specialized subclasses include OpenAIMATHCallBack (MetaMath equivalence with majority voting), DeepSeekMathCallBack (DeepSeek-Math evaluation for GSM8K/MATH), MathScaleCallBack (MathScale equivalence), and SaveOnlyCallBack (logging without evaluation). The module also provides answer cleaning classes: MCQAAnswerClean, SeparatorClean, ReActSeparatorClean, and BinaryAnswerClean.

Usage

Import these callback classes when configuring the post-processing evaluation pipeline for math and QA benchmarks. They are instantiated via Hydra configuration and attached to the evaluator pipeline to collect and score model outputs.

Code Reference

Source Location

Repository: Sail_sg_LongSpec
File: longspec/train/post_processors/openai_api_callback.py
Lines: 1-635

Signature

class OpenAICallBack:
    def __init__(
        self,
        output_file: str,
        answer_clean: Union[MCQAAnswerClean, str],
        resume: bool = False,
        index_field: str = "index",
        label_field: str = "label",
        saved_keys: List[str] = None,
    ):
        """Base callback for collecting predictions, cleaning answers, and computing accuracy."""

    def __call__(self, meta_data: Dict[str, Any], batch_model_outputs: Dict[str, Any], **kwargs):
        """Process single batch: parse vLLM output, clean answer, log to file."""

    def get_results(self) -> Tuple[dict, list]:
        """Compute acc, pass@k metrics and save results."""

class OpenAIMATHCallBack(OpenAICallBack):
    """Math evaluation with MetaMath equivalence and majority voting (maj@k)."""

class DeepSeekMathCallBack(OpenAICallBack):
    """DeepSeek-Math evaluation for GSM8K/MATH with custom extraction and eval functions."""

class MathScaleCallBack(OpenAICallBack):
    """MathScale evaluation with inline answer extraction and equivalence checking."""

class MCQAAnswerClean:
    def __init__(self, prompt: str = "zero-shot"):
        """Extract A/B/C/D/E answers from model output."""

class ReActSeparatorClean:
    def __init__(self, separator: str = "Context:", separate_idx: int = 0, regrex: str = "A|B|C|D"):
        """Extract answers from ReAct-style Finish[X] outputs."""

Import

from post_processors.openai_api_callback import (
    OpenAICallBack, OpenAIMATHCallBack, DeepSeekMathCallBack,
    MathScaleCallBack, MCQAAnswerClean, ReActSeparatorClean
)

I/O Contract

Inputs

Name	Type	Required	Description
output_file	str	Yes	Path to save results JSON
answer_clean	Callable	Yes	Answer cleaning function or class
meta_data	Dict	Per-call	Contains text, label, index fields
batch_model_outputs	Dict	Per-call	Contains "response" (str or vllm.RequestOutput)
resume	bool	No	Whether to resume from existing log file

Outputs

Name	Type	Description
metrics	dict	Contains acc, pass@k, maj@k (for math), correct, total
predictions	list[dict]	All predictions with res, pred, sc_pred, sc_res fields

Usage Examples

from post_processors.openai_api_callback import OpenAIMATHCallBack, MCQAAnswerClean

# Math evaluation callback
math_callback = OpenAIMATHCallBack(
    output_file="results/math_eval.json",
    answer_clean=MCQAAnswerClean(prompt="zero-shot"),
    eval_fn="meta_math",
)

# Process a prediction
math_callback(
    meta_data={"text": "What is 2+2?", "label": "4", "index": 0},
    batch_model_outputs={"response": "The answer is 4."}
)

# Get final metrics
metrics, _ = math_callback.get_results()
print(metrics)  # {"acc": 1.0, "pass@k": 1.0, "maj@k": 1.0, "correct": 1, "total": 1}

Related Pages

Environment:Sail_sg_LongSpec_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment