Implementation:Sail sg LongSpec OpenAI Eval Callbacks
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Mathematics |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for collecting, evaluating, and scoring model predictions via callback classes supporting MCQA, math, code, and MathScale benchmarks.
Description
The openai_api_callback.py module provides a hierarchy of evaluation callback classes used in the post-processing pipeline. OpenAICallBack is the base class handling vLLM output parsing, resumable logging, and MCQA evaluation. Specialized subclasses include OpenAIMATHCallBack (MetaMath equivalence with majority voting), DeepSeekMathCallBack (DeepSeek-Math evaluation for GSM8K/MATH), MathScaleCallBack (MathScale equivalence), and SaveOnlyCallBack (logging without evaluation). The module also provides answer cleaning classes: MCQAAnswerClean, SeparatorClean, ReActSeparatorClean, and BinaryAnswerClean.
Usage
Import these callback classes when configuring the post-processing evaluation pipeline for math and QA benchmarks. They are instantiated via Hydra configuration and attached to the evaluator pipeline to collect and score model outputs.
Code Reference
Source Location
- Repository: Sail_sg_LongSpec
- File: longspec/train/post_processors/openai_api_callback.py
- Lines: 1-635
Signature
class OpenAICallBack:
def __init__(
self,
output_file: str,
answer_clean: Union[MCQAAnswerClean, str],
resume: bool = False,
index_field: str = "index",
label_field: str = "label",
saved_keys: List[str] = None,
):
"""Base callback for collecting predictions, cleaning answers, and computing accuracy."""
def __call__(self, meta_data: Dict[str, Any], batch_model_outputs: Dict[str, Any], **kwargs):
"""Process single batch: parse vLLM output, clean answer, log to file."""
def get_results(self) -> Tuple[dict, list]:
"""Compute acc, pass@k metrics and save results."""
class OpenAIMATHCallBack(OpenAICallBack):
"""Math evaluation with MetaMath equivalence and majority voting (maj@k)."""
class DeepSeekMathCallBack(OpenAICallBack):
"""DeepSeek-Math evaluation for GSM8K/MATH with custom extraction and eval functions."""
class MathScaleCallBack(OpenAICallBack):
"""MathScale evaluation with inline answer extraction and equivalence checking."""
class MCQAAnswerClean:
def __init__(self, prompt: str = "zero-shot"):
"""Extract A/B/C/D/E answers from model output."""
class ReActSeparatorClean:
def __init__(self, separator: str = "Context:", separate_idx: int = 0, regrex: str = "A|B|C|D"):
"""Extract answers from ReAct-style Finish[X] outputs."""
Import
from post_processors.openai_api_callback import (
OpenAICallBack, OpenAIMATHCallBack, DeepSeekMathCallBack,
MathScaleCallBack, MCQAAnswerClean, ReActSeparatorClean
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_file | str | Yes | Path to save results JSON |
| answer_clean | Callable | Yes | Answer cleaning function or class |
| meta_data | Dict | Per-call | Contains text, label, index fields |
| batch_model_outputs | Dict | Per-call | Contains "response" (str or vllm.RequestOutput) |
| resume | bool | No | Whether to resume from existing log file |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics | dict | Contains acc, pass@k, maj@k (for math), correct, total |
| predictions | list[dict] | All predictions with res, pred, sc_pred, sc_res fields |
Usage Examples
from post_processors.openai_api_callback import OpenAIMATHCallBack, MCQAAnswerClean
# Math evaluation callback
math_callback = OpenAIMATHCallBack(
output_file="results/math_eval.json",
answer_clean=MCQAAnswerClean(prompt="zero-shot"),
eval_fn="meta_math",
)
# Process a prediction
math_callback(
meta_data={"text": "What is 2+2?", "label": "4", "index": 0},
batch_model_outputs={"response": "The answer is 4."}
)
# Get final metrics
metrics, _ = math_callback.get_results()
print(metrics) # {"acc": 1.0, "pass@k": 1.0, "maj@k": 1.0, "correct": 1, "total": 1}