Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI Code Reward Testing Util

From Leeroopedia
Revision as of 15:08, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hpcaitech_ColossalAI_Code_Reward_Testing_Util.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Code_Evaluation, RLHF, Reward_Computation
Last Updated 2026-02-09 00:00 GMT

Overview

testing_util.py executes and validates AI-generated code solutions against test cases in a sandboxed environment, supporting both call-based and standard-input evaluation patterns.

Description

This module provides the run_test function, the core entry point for code reward evaluation. It compiles generated code solutions into a RuntimeModule (via pyext), invokes them with test inputs, and compares outputs against expected results using multiple comparison strategies: direct equality, string stripping, float approximation via numpy.allclose, and set-based comparison. A reliability_guard function disables destructive OS operations (fork, kill, file deletion, subprocess, etc.) to create a safety sandbox. The module supports two code evaluation modes via the CODE_TYPE enum: call-based (function invocation with JSON-parsed arguments) and standard-input (stdin/stdout patching). Adapted from the verl/PRIME projects.

Usage

Use this module when computing code-based rewards in RLHF training for code generation tasks. It is invoked by the reward function pipeline to determine whether generated code solutions produce correct outputs, directly feeding into the reward signal for RL training.

Code Reference

Source Location

Key Functions

def run_test(
    in_outs,
    test=None,
    debug=False,
    timeout=15,
    run_all_tests=False,
) -> Tuple[List, Dict]

def reliability_guard(maximum_memory_bytes=None) -> None

def call_method(method, inputs)

def custom_compare_(output, ground_truth) -> bool

def stripped_string_compare(s1, s2) -> bool

def truncatefn(s, length=300) -> str

Key Classes

class CODE_TYPE(Enum):
    call_based = 0
    standard_input = 1

class Capturing(list):
    """Context manager that captures stdout output."""
    def __enter__(self) -> "Capturing"
    def __exit__(self, *args) -> None

Import

from coati.distributed.reward.code_reward.testing_util import run_test, reliability_guard

I/O Contract

Inputs (run_test)

Name Type Required Description
in_outs dict Yes Dictionary with "inputs", "outputs", and optionally "fn_name" keys defining test cases
test str Yes The generated code solution to evaluate
debug bool No Enable verbose debug output (default: False)
timeout int No Timeout in seconds per test case (default: 15)
run_all_tests bool No Whether to continue running after first failure (default: False)

Outputs (run_test)

Name Type Description
results List List of test results: True (pass), -1 (runtime error/timeout), -2 (compilation error)
error_info Dict Error metadata dict with "error", "traceback", "output", "expected", "inputs", or "error_message" keys; empty dict if all tests pass

Usage Examples

from coati.distributed.reward.code_reward.testing_util import run_test

# Call-based test
in_outs = {
    "fn_name": "add",
    "inputs": ["1\n2", "3\n4"],
    "outputs": ["3", "7"],
}
test_code = "def add(a, b): return a + b"
results, error_info = run_test(in_outs, test=test_code, timeout=10)
# results: [True, True], error_info: {}

# Standard input test
in_outs_stdin = {
    "fn_name": None,
    "inputs": ["5"],
    "outputs": ["25"],
}
test_code_stdin = "n = int(input())\nprint(n * n)"
results, error_info = run_test(in_outs_stdin, test=test_code_stdin)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment