Principle:Sail sg LongSpec Code Execution Evaluation

Knowledge Sources	Sail_sg_LongSpec
Domains	NLP, Evaluation, Code_Generation
Last Updated	2026-02-14 05:00 GMT

Overview

Algorithmic principle for evaluating code generation models by executing generated code against test cases in sandboxed environments with pass@k metrics.

Description

Code Execution Evaluation verifies model-generated code by running it against provided test cases and measuring functional correctness. Unlike math evaluation which compares strings/numbers, code evaluation requires actual execution in a sandboxed environment with timeout protection. The evaluation supports three benchmark frameworks: APPS (function-level with input/output test cases), HumanEval (function completion with assertion-based tests and entry points), and MBPP (standalone functions with assertion tests). Multi-threaded execution enables parallel evaluation of many predictions.

Usage

Apply this principle when evaluating code generation models on functional correctness benchmarks. The execution-based evaluation is necessary because textual similarity does not capture functional equivalence of programs.

Theoretical Basis

The evaluation follows an execution-based verification pattern:

# Abstract algorithm (NOT real implementation)
def evaluate_code(prediction, test_cases, timeout):
    try:
        execute_with_timeout(prediction + test_cases, timeout)
        return True  # All assertions passed
    except (AssertionError, TimeoutError, Exception):
        return False

def pass_at_k(predictions_per_problem):
    """At least one of k samples is correct."""
    return any(predictions_per_problem)

Key metrics:

acc: Accuracy of first sample (greedy decoding)
pass@k: Whether any of k samples passes all tests
Per-difficulty: APPS provides difficulty-stratified metrics

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment