Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Sail sg LongSpec Evaluation Callback Pipeline

From Leeroopedia
Knowledge Sources
Domains NLP, Evaluation, Software_Architecture
Last Updated 2026-02-14 05:00 GMT

Overview

Architectural principle for structuring model evaluation as a callback pipeline where prediction collection, answer cleaning, equivalence checking, and metrics aggregation are composed through pluggable callback classes.

Description

The Evaluation Callback Pipeline pattern structures the evaluation workflow as a chain of responsibility: a base callback class handles vLLM output parsing and prediction logging, while specialized subclasses override the evaluation logic for different benchmarks. This allows the same infrastructure (resumable logging, vLLM output handling, pass@k computation) to be reused across MCQA, math, code, and MathScale benchmarks. The pipeline supports majority voting (maj@k) for self-consistency evaluation and per-difficulty metrics for code benchmarks. Answer cleaning is separated from evaluation via pluggable cleaner classes.

Usage

Apply this principle when designing an evaluation system that needs to support multiple benchmarks with shared infrastructure. The callback pattern allows Hydra to instantiate the correct evaluation class from configuration.

Theoretical Basis

The pipeline follows a layered architecture:

# Abstract pipeline structure (NOT real implementation)
class BaseCallback:
    def __call__(self, meta_data, model_output):
        response = parse_vllm_output(model_output)
        cleaned = self.answer_clean(response)
        self.log(cleaned)

    def get_results(self):
        metrics = self.compute_metrics(self.predictions)
        self.save(metrics)

class MathCallback(BaseCallback):
    def get_results(self):
        for pred in self.predictions:
            pred["res"] = self.eval_fn(pred, ground_truth)
            pred["sc_res"] = majority_vote(pred["all_preds"])
        return aggregate_metrics()

Key design decisions:

  1. Resumable logging: JSONL append-mode for crash recovery
  2. Pluggable cleaners: Answer extraction is a strategy parameter
  3. Majority voting: Self-consistency via Counter.most_common
  4. Multi-sample: Support for n>1 generation with pass@k

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment