Implementation:Sail sg LongSpec APPS Code Evaluator

Knowledge Sources	Sail_sg_LongSpec
Domains	NLP, Evaluation, Code_Generation
Last Updated	2026-02-14 05:00 GMT

Overview

Concrete tool for evaluating model-generated code solutions against test cases using the APPS benchmark framework with multi-threaded execution.

Description

The code.py module provides two main classes: APPsEvaluator for running APPS benchmark test cases against model-generated code solutions using thread-pooled execution, and CodeExtractor for collecting, cleaning, and evaluating code predictions with support for vLLM output parsing, resumable logging, and configurable answer cleaning. The evaluator computes accuracy, pass@k, and per-difficulty metrics.

Usage

Import these classes when you need to evaluate code generation models on the APPS benchmark. CodeExtractor is used as the post-processing callback during evaluation, and APPsEvaluator runs the actual test case execution.

Code Reference

Source Location

Repository: Sail_sg_LongSpec
File: longspec/train/post_processors/code/code.py
Lines: 1-343

Signature

class APPsEvaluator:
    def __init__(self):
        """Initialize APPS evaluator."""

    def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
        """
        Evaluate predictions against test cases.
        Returns (predictions_with_results, metrics_dict).
        metrics_dict contains: acc, pass@k, correct, total, per-difficulty breakdowns.
        """

class CodeExtractor:
    def __init__(
        self,
        output_file: str,
        answer_clean: Callable,
        resume: bool = False,
        index_field: str = "index",
        test_case_field: str = "input_output",
        evaluator: Callable = None,
        num_workers: int = 8,
        saved_keys: List[str] = None,
        completion_separator: str = None,
    ):
        """Code extraction callback with resumable logging."""

    def __call__(self, meta_data: Dict[str, Any], batch_model_outputs: Dict[str, Any], **kwargs):
        """Process a single batch: extract code, clean, and log."""

    def get_results(self) -> Tuple[dict, list]:
        """Run evaluation on all collected predictions and save results."""

Import

from post_processors.code.code import APPsEvaluator, CodeExtractor

I/O Contract

Inputs

Name	Type	Required	Description
predictions	list[dict]	Yes	List of prediction dicts with "pred", "test_cases", "id" keys
num_workers	int	No	Number of threads for parallel evaluation (default 16)
output_file	str	Yes	Path to save evaluation results JSON
answer_clean	Callable	Yes	Function to clean raw model output into executable code
evaluator	Callable	No	Custom evaluator function (default uses APPsEvaluator)

Outputs

Name	Type	Description
predictions	list[dict]	Input predictions augmented with "res" and "full_res" fields
metrics	dict	Contains acc, pass@k, correct, total, and per-difficulty breakdowns

Usage Examples

from post_processors.code.code import APPsEvaluator, CodeExtractor

# Direct evaluation
evaluator = APPsEvaluator()
predictions = [
    {"pred": "def solve():\n    return 42", "test_cases": {"inputs": [""], "outputs": ["42"]}, "id": 0}
]
results, metrics = evaluator(predictions, num_workers=8)
print(metrics)  # {"acc": 1.0, "pass@k": 1.0, "correct": 1, "total": 1}

# As callback in evaluation pipeline
extractor = CodeExtractor(
    output_file="results/apps_eval.json",
    answer_clean=lambda x: x,
    evaluator=evaluator,
    num_workers=8,
)

Related Pages

Environment:Sail_sg_LongSpec_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment