Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sail sg LongSpec HumanEval MBPP Evaluator

From Leeroopedia
Knowledge Sources
Domains NLP, Evaluation, Code_Generation
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for evaluating code generation models on HumanEval and MBPP benchmarks using multi-threaded sandboxed execution with pass@k metrics.

Description

The evaluator.py module provides two evaluator classes: HumanEvaluator for the HumanEval benchmark and MBPPEvaluator for the MBPP benchmark. Both use thread-pooled execution to run model-generated code against test cases in parallel. HumanEvaluator constructs problem dicts with prompt, test, and entry_point for the HumanEval execution harness. MBPPEvaluator concatenates generated code with test cases for the MBPP execution harness. Both compute accuracy (first sample) and pass@k (any sample correct) metrics.

Usage

Import these evaluator classes when evaluating code generation models. They are passed as the evaluator parameter to CodeExtractor for end-to-end code evaluation.

Code Reference

Source Location

Signature

def return_apps_evaluator(timeout: int = 10, debug: bool = False) -> Callable:
    """Return a partial function wrapping APPS check_correctness."""

class HumanEvaluator:
    def __init__(self):
        """Initialize HumanEval evaluator."""

    def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
        """
        Evaluate predictions on HumanEval.
        Each prediction needs: pred, test_cases, prompt, entry_point, id.
        Returns (predictions_with_results, metrics_dict).
        """

class MBPPEvaluator:
    def __init__(self):
        """Initialize MBPP evaluator."""

    def __call__(self, predictions: list, num_workers: int = 16) -> Tuple[list, dict]:
        """
        Evaluate predictions on MBPP.
        Each prediction needs: pred, test_cases, id.
        Returns (predictions_with_results, metrics_dict).
        """

Import

from post_processors.code.evaluator import HumanEvaluator, MBPPEvaluator, return_apps_evaluator

I/O Contract

Inputs

Name Type Required Description
predictions list[dict] Yes List of prediction dicts with "pred", "test_cases", "id" (and "prompt", "entry_point" for HumanEval)
num_workers int No Number of threads for parallel execution (default 16)

Outputs

Name Type Description
predictions list[dict] Input predictions augmented with "res" field (bool or list[bool])
metrics dict Contains acc, pass@k, correct, total

Usage Examples

from post_processors.code.evaluator import HumanEvaluator, MBPPEvaluator

# HumanEval evaluation
human_eval = HumanEvaluator()
predictions = [{
    "pred": "def has_close_elements(numbers, threshold):\n    for i, n1 in enumerate(numbers):\n        for j, n2 in enumerate(numbers):\n            if i != j and abs(n1 - n2) < threshold:\n                return True\n    return False",
    "test_cases": "assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False",
    "prompt": "def has_close_elements(numbers, threshold):\n",
    "entry_point": "has_close_elements",
    "id": 0,
}]
results, metrics = human_eval(predictions, num_workers=4)
print(metrics)  # {"acc": 1.0, "pass@k": 1.0, "correct": 1, "total": 1}

# MBPP evaluation
mbpp_eval = MBPPEvaluator()
predictions = [{
    "pred": "def is_even(n):\n    return n % 2 == 0",
    "test_cases": "assert is_even(4) == True\nassert is_even(3) == False",
    "id": 0,
}]
results, metrics = mbpp_eval(predictions, num_workers=4)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment