Implementation:Sail sg LongSpec HumanEval MBPP Evaluator

Knowledge Sources	Sail_sg_LongSpec
Domains	NLP, Evaluation, Code_Generation
Last Updated	2026-02-14 05:00 GMT

Overview

Concrete tool for evaluating code generation models on HumanEval and MBPP benchmarks using multi-threaded sandboxed execution with pass@k metrics.

Description

The evaluator.py module provides two evaluator classes: HumanEvaluator for the HumanEval benchmark and MBPPEvaluator for the MBPP benchmark. Both use thread-pooled execution to run model-generated code against test cases in parallel. HumanEvaluator constructs problem dicts with prompt, test, and entry_point for the HumanEval execution harness. MBPPEvaluator concatenates generated code with test cases for the MBPP execution harness. Both compute accuracy (first sample) and pass@k (any sample correct) metrics.

Usage

Import these evaluator classes when evaluating code generation models. They are passed as the evaluator parameter to CodeExtractor for end-to-end code evaluation.

Code Reference

Source Location

Repository: Sail_sg_LongSpec
File: longspec/train/post_processors/code/evaluator.py
Lines: 1-173

Signature

def return_apps_evaluator(timeout: int = 10, debug: bool = False) -> Callable:
    """Return a partial function wrapping APPS check_correctness."""

class HumanEvaluator:
    def __init__(self):
        """Initialize HumanEval evaluator."""

    def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
        """
        Evaluate predictions on HumanEval.
        Each prediction needs: pred, test_cases, prompt, entry_point, id.
        Returns (predictions_with_results, metrics_dict).
        """

class MBPPEvaluator:
    def __init__(self):
        """Initialize MBPP evaluator."""

    def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
        """
        Evaluate predictions on MBPP.
        Each prediction needs: pred, test_cases, id.
        Returns (predictions_with_results, metrics_dict).
        """

Import

from post_processors.code.evaluator import HumanEvaluator, MBPPEvaluator, return_apps_evaluator

I/O Contract

Inputs

Name	Type	Required	Description
predictions	list[dict]	Yes	List of prediction dicts with "pred", "test_cases", "id" (and "prompt", "entry_point" for HumanEval)
num_workers	int	No	Number of threads for parallel execution (default 16)

Outputs

Name	Type	Description
predictions	list[dict]	Input predictions augmented with "res" field (bool or list[bool])
metrics	dict	Contains acc, pass@k, correct, total

Usage Examples

from post_processors.code.evaluator import HumanEvaluator, MBPPEvaluator

# HumanEval evaluation
human_eval = HumanEvaluator()
predictions = [{
    "pred": "def has_close_elements(numbers, threshold):\n    for i, n1 in enumerate(numbers):\n        for j, n2 in enumerate(numbers):\n            if i != j and abs(n1 - n2) < threshold:\n                return True\n    return False",
    "test_cases": "assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False",
    "prompt": "def has_close_elements(numbers, threshold):\n",
    "entry_point": "has_close_elements",
    "id": 0,
}]
results, metrics = human_eval(predictions, num_workers=4)
print(metrics)  # {"acc": 1.0, "pass@k": 1.0, "correct": 1, "total": 1}

# MBPP evaluation
mbpp_eval = MBPPEvaluator()
predictions = [{
    "pred": "def is_even(n):\n    return n % 2 == 0",
    "test_cases": "assert is_even(4) == True\nassert is_even(3) == False",
    "id": 0,
}]
results, metrics = mbpp_eval(predictions, num_workers=4)

Related Pages

Environment:Sail_sg_LongSpec_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment