Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:SqueezeAILab ETS Multi GPU Sglang Runtime

From Leeroopedia
Knowledge Sources
Domains Infrastructure, LLM_Inference
Last Updated 2026-02-14 02:30 GMT

Overview

Multi-GPU Linux environment with CUDA, Python 3, SGLang (custom ETS fork), PyTorch, PuLP, and SentenceTransformers for running ETS tree search inference.

Description

This environment provides the full runtime context for the ETS (Efficient Tree Search) inference system. It requires at least two NVIDIA GPUs: one dedicated to the policy (generator) model server and one dedicated to the process reward model (PRM) server, both launched via the custom sglang-ets fork. The main rebase.py search engine connects to these servers over HTTP and orchestrates tree search using ILP-based node selection (PuLP), reward-guided expansion, and optional diversity-aware clustering (SentenceTransformers + scipy).

The environment also requires PyTorch for tensor operations (softmax weighting, score computation), NumPy and SciPy for hierarchical clustering, and the HuggingFace transformers library for tokenizer and model config loading.

Usage

Use this environment for all ETS tree search experiments and model serving. It is the mandatory prerequisite for running the Sglang_Launch_Policy_Server, Sglang_Launch_Reward_Server, Reward_Guided_Search, Tree_Select_Softmax_Costmodel, Tree_Expand, and Result_Serialization implementations.

System Requirements

Category Requirement Notes
OS Linux Scripts use bash; CUDA_VISIBLE_DEVICES for GPU pinning
Hardware 2x NVIDIA GPUs GPU 0: policy model, GPU 1: reward model (configurable)
Hardware Sufficient VRAM per GPU Must fit the policy model (GPU 0) and reward model (GPU 1) respectively
Disk Storage for model weights + experiment outputs Model repos downloaded locally; results written to `exp_results/`

Dependencies

System Packages

  • NVIDIA CUDA toolkit (compatible with PyTorch build)
  • `tmux` or equivalent (for running multiple server processes)

Python Packages

  • `sglang` (custom ETS fork from `https://github.com/chooper1/sglang-ets`)
  • `outlines` == 0.0.44
  • `torch` (PyTorch)
  • `transformers` (HuggingFace — `AutoTokenizer`, `AutoConfig`)
  • `pulp` (ILP solver — `LpMaximize`, `LpProblem`, `LpVariable`, `PULP_CBC_CMD`)
  • `sentence-transformers` (embedding model for diversity clustering)
  • `scipy` (hierarchical clustering: `linkage`, `fcluster`)
  • `numpy`
  • `requests`
  • `pyyaml`
  • `sympy` (used by evaluation grader)
  • `pylatexenc` (LaTeX-to-text conversion in grader)
  • `regex` (used by answer extraction)
  • `tqdm`

Credentials

No API keys or credentials are required. All models are served locally from downloaded checkpoints. The policy and reward model paths are configured in the shell scripts (`run_policy.sh`, `run_reward.sh`) as local filesystem paths.

Quick Install

# Clone the repository
git clone https://github.com/SqueezeAILab/ETS.git
cd ETS

# Clone and install the custom sglang fork
git clone https://github.com/chooper1/sglang-ets.git
cd sglang-ets/python
pip install .
pip install outlines==0.0.44

# Install remaining Python dependencies
cd ../..
pip install torch transformers pulp sentence-transformers scipy numpy requests pyyaml sympy pylatexenc regex tqdm

Code Evidence

SGLang import and server endpoint usage from `rebase.py:5`:

from sglang import function, gen, RuntimeEndpoint, system, user, assistant

Thread parallelism suppression at startup from `rebase.py:9-13`:

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

GPU pinning for policy server from `scripts/run_policy.sh:10`:

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path $MODEL_REPO --port $PORT --tp-size $tensor_parellel_size --trust-remote-code

GPU pinning for reward server from `scripts/run_reward.sh:11`:

CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path $MODEL_REPO --port $PORT --tp-size $tensor_parellel_size --trust-remote-code --mem-fraction-static 0.85

ILP solver dependency from `rebase.py:20`:

from pulp import LpMaximize, LpProblem, LpVariable, lpSum, PULP_CBC_CMD

SentenceTransformer initialization from `rebase.py:733`:

multimodel = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath', device=device)

Tokenizer usage from `rebase.py:105-107`:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(config._name_or_path)
num_root_tokens = len(tokenizer(root_state.text()))

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'sglang'` sglang-ets fork not installed `cd sglang-ets/python && pip install .`
`Connection refused` on policy/reward host SGLang servers not running Start servers first: `bash scripts/run_policy.sh` and `bash scripts/run_reward.sh`
`CUDA_VISIBLE_DEVICES` not isolating GPUs Wrong GPU index Verify GPU indices with `nvidia-smi`; edit `CUDA_VISIBLE_DEVICES` in shell scripts
`ModuleNotFoundError: No module named 'pulp'` PuLP not installed `pip install pulp`
`ImportError: sentence_transformers` SentenceTransformers not installed `pip install sentence-transformers`; only needed when `lambdas > 0` in config

Compatibility Notes

  • GPU requirement: The system requires at minimum two separate NVIDIA GPUs for the policy and reward model servers. Single-GPU setups are not supported by the default scripts.
  • Tensor parallelism: Both server scripts default to `tp-size=1` (single GPU per model). Larger models may require increasing this and adjusting `CUDA_VISIBLE_DEVICES`.
  • Memory fraction: The reward server uses `--mem-fraction-static 0.85` to leave headroom for the collocated embedding model on the same GPU.
  • outlines version: The specific version `outlines==0.0.44` is pinned in the installation instructions; other versions may cause compatibility issues with the sglang fork.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment