Principle:Sail sg LongSpec Evaluation Callback Pipeline
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Software_Architecture |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Architectural principle for structuring model evaluation as a callback pipeline where prediction collection, answer cleaning, equivalence checking, and metrics aggregation are composed through pluggable callback classes.
Description
The Evaluation Callback Pipeline pattern structures the evaluation workflow as a chain of responsibility: a base callback class handles vLLM output parsing and prediction logging, while specialized subclasses override the evaluation logic for different benchmarks. This allows the same infrastructure (resumable logging, vLLM output handling, pass@k computation) to be reused across MCQA, math, code, and MathScale benchmarks. The pipeline supports majority voting (maj@k) for self-consistency evaluation and per-difficulty metrics for code benchmarks. Answer cleaning is separated from evaluation via pluggable cleaner classes.
Usage
Apply this principle when designing an evaluation system that needs to support multiple benchmarks with shared infrastructure. The callback pattern allows Hydra to instantiate the correct evaluation class from configuration.
Theoretical Basis
The pipeline follows a layered architecture:
# Abstract pipeline structure (NOT real implementation)
class BaseCallback:
def __call__(self, meta_data, model_output):
response = parse_vllm_output(model_output)
cleaned = self.answer_clean(response)
self.log(cleaned)
def get_results(self):
metrics = self.compute_metrics(self.predictions)
self.save(metrics)
class MathCallback(BaseCallback):
def get_results(self):
for pred in self.predictions:
pred["res"] = self.eval_fn(pred, ground_truth)
pred["sc_res"] = majority_vote(pred["all_preds"])
return aggregate_metrics()
Key design decisions:
- Resumable logging: JSONL append-mode for crash recovery
- Pluggable cleaners: Answer extraction is a strategy parameter
- Majority voting: Self-consistency via Counter.most_common
- Multi-sample: Support for n>1 generation with pass@k