Heuristic:SqueezeAILab ETS Thread Parallelism Suppression

Knowledge Sources	SqueezeAILab ETS Common practice in multi-threaded Python inference pipelines
Domains	Optimization, Infrastructure
Last Updated	2026-02-14 02:30 GMT

Overview

Suppress internal library thread parallelism by setting `OMP_NUM_THREADS`, `MKL_NUM_THREADS`, and related environment variables to 1, preventing thread contention in multi-threaded SGLang batch inference.

Description

When running multi-threaded inference (via SGLang's `run_batch` with `num_threads`), internal numerical libraries (OpenMP, MKL, OpenBLAS, NUMEXPR) each attempt to spawn their own thread pools. This causes severe thread contention: if 8 SGLang threads each trigger 8 OpenMP threads, the system faces 64 competing threads for CPU resources. Additionally, HuggingFace tokenizers have their own parallelism which conflicts with external threading.

The fix is to suppress all library-level parallelism by setting each library's thread count to 1 before any imports, forcing single-threaded execution within each SGLang worker thread.

Usage

Apply this heuristic whenever running ETS tree search with multiple SGLang threads (`num_threads > 1` in the YAML config). Failure to suppress parallelism can lead to severe slowdowns, deadlocks, or non-deterministic hangs from thread over-subscription.

The Insight (Rule of Thumb)

Action: Set the following environment variables before any library imports:
- `TOKENIZERS_PARALLELISM=false`
- `OMP_NUM_THREADS=1`
- `MKL_NUM_THREADS=1`
- `OPENBLAS_NUM_THREADS=1`
- `NUMEXPR_NUM_THREADS=1`
Value: All set to `"1"` (or `"false"` for tokenizers).
Trade-off: Individual numpy/torch CPU operations within a single thread will be slower (no intra-op parallelism), but overall throughput improves because inter-thread contention is eliminated.

Reasoning

The ETS system uses SGLang's `run_batch` with `num_threads=8` (default config). Each thread independently calls `SentenceTransformer.encode()`, `torch.exp()`, `numpy` clustering, and HuggingFace tokenizers. Without suppression, each of these libraries spawns its own thread pool (typically matching CPU core count), causing O(N*M) threads where N=SGLang threads and M=library threads. On a typical 8-core machine with 8 SGLang threads and 8 OMP threads, this creates 64 competing threads with massive context-switch overhead.

Setting these variables to 1 ensures each SGLang thread runs its numerical operations serially, while the 8 SGLang threads provide the desired parallelism at the application level.

The variables must be set before imports (lines 9-13, before line 17 `import torch`) because libraries read these variables at import time and configure their thread pools once.

Code Evidence

Thread suppression at module top-level from `rebase.py:9-13`:

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

Multi-threaded batch execution from `rebase.py:746`:

states = reward_guided_search.run_batch(input_list_dict, backend=RuntimeEndpoint(args.policy_host), num_threads=paras["num_threads"], progress_bar=True)

Default thread count from `hype-parameters/ets_16_math500.yaml:5`:

num_threads: 8 # threads in SGLang

Related Pages

Implementation:SqueezeAILab_ETS_Reward_Guided_Search

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment