Environment:SqueezeAILab ETS Multi GPU Sglang Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Inference |
| Last Updated | 2026-02-14 02:30 GMT |
Overview
Multi-GPU Linux environment with CUDA, Python 3, SGLang (custom ETS fork), PyTorch, PuLP, and SentenceTransformers for running ETS tree search inference.
Description
This environment provides the full runtime context for the ETS (Efficient Tree Search) inference system. It requires at least two NVIDIA GPUs: one dedicated to the policy (generator) model server and one dedicated to the process reward model (PRM) server, both launched via the custom sglang-ets fork. The main rebase.py search engine connects to these servers over HTTP and orchestrates tree search using ILP-based node selection (PuLP), reward-guided expansion, and optional diversity-aware clustering (SentenceTransformers + scipy).
The environment also requires PyTorch for tensor operations (softmax weighting, score computation), NumPy and SciPy for hierarchical clustering, and the HuggingFace transformers library for tokenizer and model config loading.
Usage
Use this environment for all ETS tree search experiments and model serving. It is the mandatory prerequisite for running the Sglang_Launch_Policy_Server, Sglang_Launch_Reward_Server, Reward_Guided_Search, Tree_Select_Softmax_Costmodel, Tree_Expand, and Result_Serialization implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Scripts use bash; CUDA_VISIBLE_DEVICES for GPU pinning |
| Hardware | 2x NVIDIA GPUs | GPU 0: policy model, GPU 1: reward model (configurable) |
| Hardware | Sufficient VRAM per GPU | Must fit the policy model (GPU 0) and reward model (GPU 1) respectively |
| Disk | Storage for model weights + experiment outputs | Model repos downloaded locally; results written to `exp_results/` |
Dependencies
System Packages
- NVIDIA CUDA toolkit (compatible with PyTorch build)
- `tmux` or equivalent (for running multiple server processes)
Python Packages
- `sglang` (custom ETS fork from `https://github.com/chooper1/sglang-ets`)
- `outlines` == 0.0.44
- `torch` (PyTorch)
- `transformers` (HuggingFace — `AutoTokenizer`, `AutoConfig`)
- `pulp` (ILP solver — `LpMaximize`, `LpProblem`, `LpVariable`, `PULP_CBC_CMD`)
- `sentence-transformers` (embedding model for diversity clustering)
- `scipy` (hierarchical clustering: `linkage`, `fcluster`)
- `numpy`
- `requests`
- `pyyaml`
- `sympy` (used by evaluation grader)
- `pylatexenc` (LaTeX-to-text conversion in grader)
- `regex` (used by answer extraction)
- `tqdm`
Credentials
No API keys or credentials are required. All models are served locally from downloaded checkpoints. The policy and reward model paths are configured in the shell scripts (`run_policy.sh`, `run_reward.sh`) as local filesystem paths.
Quick Install
# Clone the repository
git clone https://github.com/SqueezeAILab/ETS.git
cd ETS
# Clone and install the custom sglang fork
git clone https://github.com/chooper1/sglang-ets.git
cd sglang-ets/python
pip install .
pip install outlines==0.0.44
# Install remaining Python dependencies
cd ../..
pip install torch transformers pulp sentence-transformers scipy numpy requests pyyaml sympy pylatexenc regex tqdm
Code Evidence
SGLang import and server endpoint usage from `rebase.py:5`:
from sglang import function, gen, RuntimeEndpoint, system, user, assistant
Thread parallelism suppression at startup from `rebase.py:9-13`:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
GPU pinning for policy server from `scripts/run_policy.sh:10`:
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path $MODEL_REPO --port $PORT --tp-size $tensor_parellel_size --trust-remote-code
GPU pinning for reward server from `scripts/run_reward.sh:11`:
CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path $MODEL_REPO --port $PORT --tp-size $tensor_parellel_size --trust-remote-code --mem-fraction-static 0.85
ILP solver dependency from `rebase.py:20`:
from pulp import LpMaximize, LpProblem, LpVariable, lpSum, PULP_CBC_CMD
SentenceTransformer initialization from `rebase.py:733`:
multimodel = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath', device=device)
Tokenizer usage from `rebase.py:105-107`:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(config._name_or_path)
num_root_tokens = len(tokenizer(root_state.text()))
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ModuleNotFoundError: No module named 'sglang'` | sglang-ets fork not installed | `cd sglang-ets/python && pip install .` |
| `Connection refused` on policy/reward host | SGLang servers not running | Start servers first: `bash scripts/run_policy.sh` and `bash scripts/run_reward.sh` |
| `CUDA_VISIBLE_DEVICES` not isolating GPUs | Wrong GPU index | Verify GPU indices with `nvidia-smi`; edit `CUDA_VISIBLE_DEVICES` in shell scripts |
| `ModuleNotFoundError: No module named 'pulp'` | PuLP not installed | `pip install pulp` |
| `ImportError: sentence_transformers` | SentenceTransformers not installed | `pip install sentence-transformers`; only needed when `lambdas > 0` in config |
Compatibility Notes
- GPU requirement: The system requires at minimum two separate NVIDIA GPUs for the policy and reward model servers. Single-GPU setups are not supported by the default scripts.
- Tensor parallelism: Both server scripts default to `tp-size=1` (single GPU per model). Larger models may require increasing this and adjusting `CUDA_VISIBLE_DEVICES`.
- Memory fraction: The reward server uses `--mem-fraction-static 0.85` to leave headroom for the collocated embedding model on the same GPU.
- outlines version: The specific version `outlines==0.0.44` is pinned in the installation instructions; other versions may cause compatibility issues with the sglang fork.
Related Pages
- Implementation:SqueezeAILab_ETS_Sglang_ETS_Installation
- Implementation:SqueezeAILab_ETS_Sglang_Launch_Policy_Server
- Implementation:SqueezeAILab_ETS_Sglang_Launch_Reward_Server
- Implementation:SqueezeAILab_ETS_Reward_Guided_Search
- Implementation:SqueezeAILab_ETS_Tree_Select_Softmax_Costmodel
- Implementation:SqueezeAILab_ETS_Tree_Expand
- Implementation:SqueezeAILab_ETS_Result_Serialization