Implementation:Sail sg LongSpec APPS Code Evaluator
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Code_Generation |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for evaluating model-generated code solutions against test cases using the APPS benchmark framework with multi-threaded execution.
Description
The code.py module provides two main classes: APPsEvaluator for running APPS benchmark test cases against model-generated code solutions using thread-pooled execution, and CodeExtractor for collecting, cleaning, and evaluating code predictions with support for vLLM output parsing, resumable logging, and configurable answer cleaning. The evaluator computes accuracy, pass@k, and per-difficulty metrics.
Usage
Import these classes when you need to evaluate code generation models on the APPS benchmark. CodeExtractor is used as the post-processing callback during evaluation, and APPsEvaluator runs the actual test case execution.
Code Reference
Source Location
- Repository: Sail_sg_LongSpec
- File: longspec/train/post_processors/code/code.py
- Lines: 1-343
Signature
class APPsEvaluator:
def __init__(self):
"""Initialize APPS evaluator."""
def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
"""
Evaluate predictions against test cases.
Returns (predictions_with_results, metrics_dict).
metrics_dict contains: acc, pass@k, correct, total, per-difficulty breakdowns.
"""
class CodeExtractor:
def __init__(
self,
output_file: str,
answer_clean: Callable,
resume: bool = False,
index_field: str = "index",
test_case_field: str = "input_output",
evaluator: Callable = None,
num_workers: int = 8,
saved_keys: List[str] = None,
completion_separator: str = None,
):
"""Code extraction callback with resumable logging."""
def __call__(self, meta_data: Dict[str, Any], batch_model_outputs: Dict[str, Any], **kwargs):
"""Process a single batch: extract code, clean, and log."""
def get_results(self) -> Tuple[dict, list]:
"""Run evaluation on all collected predictions and save results."""
Import
from post_processors.code.code import APPsEvaluator, CodeExtractor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| predictions | list[dict] | Yes | List of prediction dicts with "pred", "test_cases", "id" keys |
| num_workers | int | No | Number of threads for parallel evaluation (default 16) |
| output_file | str | Yes | Path to save evaluation results JSON |
| answer_clean | Callable | Yes | Function to clean raw model output into executable code |
| evaluator | Callable | No | Custom evaluator function (default uses APPsEvaluator) |
Outputs
| Name | Type | Description |
|---|---|---|
| predictions | list[dict] | Input predictions augmented with "res" and "full_res" fields |
| metrics | dict | Contains acc, pass@k, correct, total, and per-difficulty breakdowns |
Usage Examples
from post_processors.code.code import APPsEvaluator, CodeExtractor
# Direct evaluation
evaluator = APPsEvaluator()
predictions = [
{"pred": "def solve():\n return 42", "test_cases": {"inputs": [""], "outputs": ["42"]}, "id": 0}
]
results, metrics = evaluator(predictions, num_workers=8)
print(metrics) # {"acc": 1.0, "pass@k": 1.0, "correct": 1, "total": 1}
# As callback in evaluation pipeline
extractor = CodeExtractor(
output_file="results/apps_eval.json",
answer_clean=lambda x: x,
evaluator=evaluator,
num_workers=8,
)