Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Wandb Weave Batch Processing Tuning

From Leeroopedia
Knowledge Sources
Domains Optimization, Performance, Tracing
Last Updated 2026-02-14 12:00 GMT

Overview

Tuning guide for Weave's call batch processor parameters to balance throughput, memory usage, and data durability.

Description

Weave uses an asynchronous batch processing system to send trace data (call starts and ends) to the server. The system pairs call starts with their corresponding ends to form complete calls before sending them as batches. Key tunable parameters include the maximum queue size, batch size, flush timeout, and pending call limits. Understanding these parameters is essential for high-throughput environments where tracing overhead must be minimized, and for debugging scenarios where call data may be lost.

Usage

Use this heuristic when you are experiencing trace data loss, high memory usage from queued calls, or need to optimize throughput in a high-volume tracing deployment. Also consult when debugging why calls appear incomplete or are dropped at shutdown.

The Insight (Rule of Thumb)

  • Action: Configure `WEAVE_MAX_CALLS_QUEUE_SIZE` based on your workload volume.
  • Value: Default is 100,000 items. Set to 0 for unbounded queue growth (risk of OOM).
  • Trade-off: Larger queues allow burst absorption but consume more memory; smaller queues drop items sooner but stay lean.
  • Action: Understand that `MAX_BATCH_SIZE` is 1,000 items per batch; oversized batches are auto-split by the server.
  • Value: Err on larger batch sizes for high-velocity environments; 413 responses trigger automatic binary split and retry.
  • Trade-off: Larger batches reduce HTTP overhead but increase per-request latency and memory.
  • Action: On shutdown, the system waits `FLUSH_TIMEOUT_SECONDS` (60s) for in-flight calls to pair before dropping unpaired items.
  • Value: 60 seconds default. Unpaired items are logged as warnings then dropped.
  • Trade-off: Longer timeouts delay shutdown; shorter timeouts risk data loss.
  • Action: Enable `WEAVE_ENABLE_DISK_FALLBACK=true` (default) to persist dropped items to `.weave_client_dropped_items_log.jsonl`.
  • Value: Dropped items written to JSONL file instead of being silently lost.
  • Trade-off: Disk I/O overhead vs. data preservation.

Reasoning

The batch processor uses a pairing mechanism: call starts are held in a pending cache (TTLCache with 24-hour expiry, max 10,000 pending calls) until their corresponding end events arrive. Once paired, the complete call is enqueued for batch sending. This pairing optimization reduces HTTP requests by 50% compared to sending starts and ends separately. The 24-hour TTL prevents memory leaks from orphaned starts (e.g., crashed processes that never finish calls).

For error handling, the processor uses a SkipIndividualProcessingError pattern: when a retryable error occurs on a batch, items are requeued rather than attempting expensive item-by-item fallback. Non-retryable errors cause items to be logged and dropped.

Code Evidence

Batch processor constants from `weave/trace_server_bindings/call_batch_processor.py:36-45`:

# Default limit for pending (unpaired) calls
DEFAULT_MAX_PENDING_CALLS = 10_000
# Max batch size, this can be quite large, as long as traces are small
# Too-large batches are automatically split, and 413s are retried, lets
# err on the side of larger batch sizes for high velocity environments.
MAX_BATCH_SIZE = 1000
# TTL for eager call IDs (24 hours in seconds)
EAGER_CALL_ID_TTL_SECONDS = 24 * 60 * 60
# Timeout for flush: wait this long for in-flight calls to complete before dropping
FLUSH_TIMEOUT_SECONDS = 60

Queue size configuration from `weave/trace/settings.py:157-163`:

max_calls_queue_size: int = 100_000
"""
Sets the maximum size of the calls queue.  Defaults to 100_000.
Setting a value of 0 means the queue can grow unbounded.
"""

Disk fallback configuration from `weave/trace/settings.py:179-186`:

enable_disk_fallback: bool = True
"""
Toggles disk fallback for dropped items.
If True, items that fail to be processed or are dropped due to queue limits
will be written to disk as a fallback instead of being lost.
"""

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment