Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Sail sg LongSpec Code Execution Evaluation

From Leeroopedia
Knowledge Sources
Domains NLP, Evaluation, Code_Generation
Last Updated 2026-02-14 05:00 GMT

Overview

Algorithmic principle for evaluating code generation models by executing generated code against test cases in sandboxed environments with pass@k metrics.

Description

Code Execution Evaluation verifies model-generated code by running it against provided test cases and measuring functional correctness. Unlike math evaluation which compares strings/numbers, code evaluation requires actual execution in a sandboxed environment with timeout protection. The evaluation supports three benchmark frameworks: APPS (function-level with input/output test cases), HumanEval (function completion with assertion-based tests and entry points), and MBPP (standalone functions with assertion tests). Multi-threaded execution enables parallel evaluation of many predictions.

Usage

Apply this principle when evaluating code generation models on functional correctness benchmarks. The execution-based evaluation is necessary because textual similarity does not capture functional equivalence of programs.

Theoretical Basis

The evaluation follows an execution-based verification pattern:

# Abstract algorithm (NOT real implementation)
def evaluate_code(prediction, test_cases, timeout):
    try:
        execute_with_timeout(prediction + test_cases, timeout)
        return True  # All assertions passed
    except (AssertionError, TimeoutError, Exception):
        return False

def pass_at_k(predictions_per_problem):
    """At least one of k samples is correct."""
    return any(predictions_per_problem)

Key metrics:

  • acc: Accuracy of first sample (greedy decoding)
  • pass@k: Whether any of k samples passes all tests
  • Per-difficulty: APPS provides difficulty-stratified metrics

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment