Principle:Sail sg LongSpec Evaluation Callback Pipeline

Knowledge Sources	Sail_sg_LongSpec
Domains	NLP, Evaluation, Software_Architecture
Last Updated	2026-02-14 05:00 GMT

Overview

Architectural principle for structuring model evaluation as a callback pipeline where prediction collection, answer cleaning, equivalence checking, and metrics aggregation are composed through pluggable callback classes.

Description

The Evaluation Callback Pipeline pattern structures the evaluation workflow as a chain of responsibility: a base callback class handles vLLM output parsing and prediction logging, while specialized subclasses override the evaluation logic for different benchmarks. This allows the same infrastructure (resumable logging, vLLM output handling, pass@k computation) to be reused across MCQA, math, code, and MathScale benchmarks. The pipeline supports majority voting (maj@k) for self-consistency evaluation and per-difficulty metrics for code benchmarks. Answer cleaning is separated from evaluation via pluggable cleaner classes.

Usage

Apply this principle when designing an evaluation system that needs to support multiple benchmarks with shared infrastructure. The callback pattern allows Hydra to instantiate the correct evaluation class from configuration.

Theoretical Basis

The pipeline follows a layered architecture:

# Abstract pipeline structure (NOT real implementation)
class BaseCallback:
    def __call__(self, meta_data, model_output):
        response = parse_vllm_output(model_output)
        cleaned = self.answer_clean(response)
        self.log(cleaned)

    def get_results(self):
        metrics = self.compute_metrics(self.predictions)
        self.save(metrics)

class MathCallback(BaseCallback):
    def get_results(self):
        for pred in self.predictions:
            pred["res"] = self.eval_fn(pred, ground_truth)
            pred["sc_res"] = majority_vote(pred["all_preds"])
        return aggregate_metrics()

Key design decisions:

Resumable logging: JSONL append-mode for crash recovery
Pluggable cleaners: Answer extraction is a strategy parameter
Majority voting: Self-consistency via Counter.most_common
Multi-sample: Support for n>1 generation with pass@k

Related Pages

Implementation:Sail_sg_LongSpec_OpenAI_Eval_Callbacks

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment