Environment:SqueezeAILab ETS Multi GPU Sglang Runtime

Knowledge Sources	SqueezeAILab ETS sglang-ets fork SGLang Documentation
Domains	Infrastructure, LLM_Inference
Last Updated	2026-02-14 02:30 GMT

Overview

Multi-GPU Linux environment with CUDA, Python 3, SGLang (custom ETS fork), PyTorch, PuLP, and SentenceTransformers for running ETS tree search inference.

Description

This environment provides the full runtime context for the ETS (Efficient Tree Search) inference system. It requires at least two NVIDIA GPUs: one dedicated to the policy (generator) model server and one dedicated to the process reward model (PRM) server, both launched via the custom sglang-ets fork. The main rebase.py search engine connects to these servers over HTTP and orchestrates tree search using ILP-based node selection (PuLP), reward-guided expansion, and optional diversity-aware clustering (SentenceTransformers + scipy).

The environment also requires PyTorch for tensor operations (softmax weighting, score computation), NumPy and SciPy for hierarchical clustering, and the HuggingFace transformers library for tokenizer and model config loading.

Usage

Use this environment for all ETS tree search experiments and model serving. It is the mandatory prerequisite for running the Sglang_Launch_Policy_Server, Sglang_Launch_Reward_Server, Reward_Guided_Search, Tree_Select_Softmax_Costmodel, Tree_Expand, and Result_Serialization implementations.

System Requirements

Category	Requirement	Notes
OS	Linux	Scripts use bash; CUDA_VISIBLE_DEVICES for GPU pinning
Hardware	2x NVIDIA GPUs	GPU 0: policy model, GPU 1: reward model (configurable)
Hardware	Sufficient VRAM per GPU	Must fit the policy model (GPU 0) and reward model (GPU 1) respectively
Disk	Storage for model weights + experiment outputs	Model repos downloaded locally; results written to `exp_results/`

Dependencies

System Packages

NVIDIA CUDA toolkit (compatible with PyTorch build)
`tmux` or equivalent (for running multiple server processes)

Python Packages

`sglang` (custom ETS fork from `https://github.com/chooper1/sglang-ets`)
`outlines` == 0.0.44
`torch` (PyTorch)
`transformers` (HuggingFace — `AutoTokenizer`, `AutoConfig`)
`pulp` (ILP solver — `LpMaximize`, `LpProblem`, `LpVariable`, `PULP_CBC_CMD`)
`sentence-transformers` (embedding model for diversity clustering)
`scipy` (hierarchical clustering: `linkage`, `fcluster`)
`numpy`
`requests`
`pyyaml`
`sympy` (used by evaluation grader)
`pylatexenc` (LaTeX-to-text conversion in grader)
`regex` (used by answer extraction)
`tqdm`

Credentials

No API keys or credentials are required. All models are served locally from downloaded checkpoints. The policy and reward model paths are configured in the shell scripts (`run_policy.sh`, `run_reward.sh`) as local filesystem paths.

Quick Install

# Clone the repository
git clone https://github.com/SqueezeAILab/ETS.git
cd ETS

# Clone and install the custom sglang fork
git clone https://github.com/chooper1/sglang-ets.git
cd sglang-ets/python
pip install .
pip install outlines==0.0.44

# Install remaining Python dependencies
cd ../..
pip install torch transformers pulp sentence-transformers scipy numpy requests pyyaml sympy pylatexenc regex tqdm

Code Evidence

SGLang import and server endpoint usage from `rebase.py:5`:

from sglang import function, gen, RuntimeEndpoint, system, user, assistant

Thread parallelism suppression at startup from `rebase.py:9-13`:

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

GPU pinning for policy server from `scripts/run_policy.sh:10`:

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path $MODEL_REPO --port $PORT --tp-size $tensor_parellel_size --trust-remote-code

GPU pinning for reward server from `scripts/run_reward.sh:11`:

CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path $MODEL_REPO --port $PORT --tp-size $tensor_parellel_size --trust-remote-code --mem-fraction-static 0.85

ILP solver dependency from `rebase.py:20`:

from pulp import LpMaximize, LpProblem, LpVariable, lpSum, PULP_CBC_CMD

SentenceTransformer initialization from `rebase.py:733`:

multimodel = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath', device=device)

Tokenizer usage from `rebase.py:105-107`:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(config._name_or_path)
num_root_tokens = len(tokenizer(root_state.text()))

Common Errors

Error Message	Cause	Solution
`ModuleNotFoundError: No module named 'sglang'`	sglang-ets fork not installed	`cd sglang-ets/python && pip install .`
`Connection refused` on policy/reward host	SGLang servers not running	Start servers first: `bash scripts/run_policy.sh` and `bash scripts/run_reward.sh`
`CUDA_VISIBLE_DEVICES` not isolating GPUs	Wrong GPU index	Verify GPU indices with `nvidia-smi`; edit `CUDA_VISIBLE_DEVICES` in shell scripts
`ModuleNotFoundError: No module named 'pulp'`	PuLP not installed	`pip install pulp`
`ImportError: sentence_transformers`	SentenceTransformers not installed	`pip install sentence-transformers`; only needed when `lambdas > 0` in config

Compatibility Notes

GPU requirement: The system requires at minimum two separate NVIDIA GPUs for the policy and reward model servers. Single-GPU setups are not supported by the default scripts.
Tensor parallelism: Both server scripts default to `tp-size=1` (single GPU per model). Larger models may require increasing this and adjusting `CUDA_VISIBLE_DEVICES`.
Memory fraction: The reward server uses `--mem-fraction-static 0.85` to leave headroom for the collocated embedding model on the same GPU.
outlines version: The specific version `outlines==0.0.44` is pinned in the installation instructions; other versions may cause compatibility issues with the sglang fork.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment