Principle:Sail sg LongSpec Code Execution Evaluation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Code_Generation |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Algorithmic principle for evaluating code generation models by executing generated code against test cases in sandboxed environments with pass@k metrics.
Description
Code Execution Evaluation verifies model-generated code by running it against provided test cases and measuring functional correctness. Unlike math evaluation which compares strings/numbers, code evaluation requires actual execution in a sandboxed environment with timeout protection. The evaluation supports three benchmark frameworks: APPS (function-level with input/output test cases), HumanEval (function completion with assertion-based tests and entry points), and MBPP (standalone functions with assertion tests). Multi-threaded execution enables parallel evaluation of many predictions.
Usage
Apply this principle when evaluating code generation models on functional correctness benchmarks. The execution-based evaluation is necessary because textual similarity does not capture functional equivalence of programs.
Theoretical Basis
The evaluation follows an execution-based verification pattern:
# Abstract algorithm (NOT real implementation)
def evaluate_code(prediction, test_cases, timeout):
try:
execute_with_timeout(prediction + test_cases, timeout)
return True # All assertions passed
except (AssertionError, TimeoutError, Exception):
return False
def pass_at_k(predictions_per_problem):
"""At least one of k samples is correct."""
return any(predictions_per_problem)
Key metrics:
- acc: Accuracy of first sample (greedy decoding)
- pass@k: Whether any of k samples passes all tests
- Per-difficulty: APPS provides difficulty-stratified metrics