Heuristic:Openai Evals Event Batching Configuration
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Configuration guidance for tuning event batching and flush intervals when using the HTTP or Snowflake recording backends.
Description
The OpenAI Evals recording system (`evals/record.py`) supports multiple backends: local JSON files, HTTP endpoints, and Snowflake databases. The non-local backends use event batching to reduce network overhead. Three constants control this behavior: `MIN_FLUSH_EVENTS` (100 events before triggering a flush), `MIN_FLUSH_SECONDS` (10 seconds minimum between flushes), and `MAX_SNOWFLAKE_BYTES` (16MB max per Snowflake batch). The HTTP backend has additional CLI-configurable parameters: `--http-batch-size` (default 100) and `--http-fail-percent-threshold` (default 5%).
Usage
Use this heuristic when configuring the HTTP or Snowflake recording backends for production eval runs, or when encountering event delivery failures or slow logging performance.
The Insight (Rule of Thumb)
- Action: For HTTP recording, set `--http-batch-size` to match your endpoint's optimal batch size.
- Value: Default is `100`. Increase for high-throughput endpoints; decrease if encountering timeout issues.
- Trade-off: Larger batches = fewer HTTP requests but higher risk of data loss per failed request.
- Action: Monitor the HTTP failure rate and adjust `--http-fail-percent-threshold`.
- Value: Default is `5%`. If more than 5% of HTTP batches fail, the recorder raises a `RuntimeError` but still saves events locally.
- Trade-off: Higher threshold = more tolerance for flaky endpoints but risk of incomplete remote logging. Lower threshold = stricter quality guarantee.
- Action: For Snowflake recording, ensure individual events stay well under 16MB.
- Value: `MAX_SNOWFLAKE_BYTES = 16 * 10**6` (16MB per batch).
- Trade-off: Exceeding this limit will cause batch flush failures.
Reasoning
Event batching amortizes the overhead of network calls across multiple events. Without batching, each eval sample would trigger an individual HTTP request or Snowflake insert, which is prohibitively slow for evals with thousands of samples. The default values (100 events, 10-second minimum interval) are tuned for a balance between latency and throughput. The fail-safe mechanism (falling back to local recording on HTTP failure) ensures that evaluation results are never lost, even if the remote backend is unreliable.
Code Evidence
Batching constants from `evals/record.py:30-32`:
MIN_FLUSH_EVENTS = 100
MAX_SNOWFLAKE_BYTES = 16 * 10**6
MIN_FLUSH_SECONDS = 10
HTTP batch size CLI argument from `evals/cli/oaieval.py:79-83`:
parser.add_argument(
"--http-batch-size",
type=int,
default=100,
help="Number of events to send in each HTTP request when in HTTP mode.",
)
HTTP failure threshold from `evals/cli/oaieval.py:85-89`:
parser.add_argument(
"--http-fail-percent-threshold",
type=int,
default=5,
help="The acceptable percentage threshold of HTTP requests that can fail.",
)
Retry mechanism from `evals/utils/api_utils.py:10-14`:
@backoff.on_predicate(
wait_gen=backoff.expo,
max_value=60,
factor=1.5,
)