Implementation:Sail sg LongSpec HumanEval MBPP Evaluator
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Code_Generation |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for evaluating code generation models on HumanEval and MBPP benchmarks using multi-threaded sandboxed execution with pass@k metrics.
Description
The evaluator.py module provides two evaluator classes: HumanEvaluator for the HumanEval benchmark and MBPPEvaluator for the MBPP benchmark. Both use thread-pooled execution to run model-generated code against test cases in parallel. HumanEvaluator constructs problem dicts with prompt, test, and entry_point for the HumanEval execution harness. MBPPEvaluator concatenates generated code with test cases for the MBPP execution harness. Both compute accuracy (first sample) and pass@k (any sample correct) metrics.
Usage
Import these evaluator classes when evaluating code generation models. They are passed as the evaluator parameter to CodeExtractor for end-to-end code evaluation.
Code Reference
Source Location
- Repository: Sail_sg_LongSpec
- File: longspec/train/post_processors/code/evaluator.py
- Lines: 1-173
Signature
def return_apps_evaluator(timeout: int = 10, debug: bool = False) -> Callable:
"""Return a partial function wrapping APPS check_correctness."""
class HumanEvaluator:
def __init__(self):
"""Initialize HumanEval evaluator."""
def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
"""
Evaluate predictions on HumanEval.
Each prediction needs: pred, test_cases, prompt, entry_point, id.
Returns (predictions_with_results, metrics_dict).
"""
class MBPPEvaluator:
def __init__(self):
"""Initialize MBPP evaluator."""
def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
"""
Evaluate predictions on MBPP.
Each prediction needs: pred, test_cases, id.
Returns (predictions_with_results, metrics_dict).
"""
Import
from post_processors.code.evaluator import HumanEvaluator, MBPPEvaluator, return_apps_evaluator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| predictions | list[dict] | Yes | List of prediction dicts with "pred", "test_cases", "id" (and "prompt", "entry_point" for HumanEval) |
| num_workers | int | No | Number of threads for parallel execution (default 16) |
Outputs
| Name | Type | Description |
|---|---|---|
| predictions | list[dict] | Input predictions augmented with "res" field (bool or list[bool]) |
| metrics | dict | Contains acc, pass@k, correct, total |
Usage Examples
from post_processors.code.evaluator import HumanEvaluator, MBPPEvaluator
# HumanEval evaluation
human_eval = HumanEvaluator()
predictions = [{
"pred": "def has_close_elements(numbers, threshold):\n for i, n1 in enumerate(numbers):\n for j, n2 in enumerate(numbers):\n if i != j and abs(n1 - n2) < threshold:\n return True\n return False",
"test_cases": "assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False",
"prompt": "def has_close_elements(numbers, threshold):\n",
"entry_point": "has_close_elements",
"id": 0,
}]
results, metrics = human_eval(predictions, num_workers=4)
print(metrics) # {"acc": 1.0, "pass@k": 1.0, "correct": 1, "total": 1}
# MBPP evaluation
mbpp_eval = MBPPEvaluator()
predictions = [{
"pred": "def is_even(n):\n return n % 2 == 0",
"test_cases": "assert is_even(4) == True\nassert is_even(3) == False",
"id": 0,
}]
results, metrics = mbpp_eval(predictions, num_workers=4)