Implementation:SqueezeAILab ETS Sglang Launch Policy Server
| Knowledge Sources | |
|---|---|
| Domains | Inference, Model_Serving |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
Concrete tool for launching an SGLang HTTP server hosting the policy (generator) model on a dedicated GPU.
Description
This shell script launches the SGLang inference server for the policy model. It binds the server to a specific GPU using CUDA_VISIBLE_DEVICES, configures tensor parallelism, and starts serving on the specified port. The server provides an HTTP API that the tree search engine (rebase.py) connects to via RuntimeEndpoint.
Usage
Run this script in a separate terminal or background process before starting the ETS tree search. The server must be fully initialized (model loaded into GPU memory) before rebase.py is invoked.
Code Reference
Source Location
- Repository: ETS
- File: scripts/run_policy.sh
- Lines: 1-10
Signature
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
--model-path $MODEL_REPO \
--port $PORT \
--tp-size $tensor_parellel_size \
--trust-remote-code
Import
# Client-side connection from rebase.py:
from sglang import RuntimeEndpoint
policy_endpoint = RuntimeEndpoint("http://localhost:30000")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| MODEL_REPO | str | Yes | Path to policy model weights (HuggingFace model ID or local directory) |
| PORT | int | Yes | HTTP port to serve on (default: 30000) |
| tensor_parellel_size | int | Yes | Number of GPUs for tensor parallelism (default: 1) |
| CUDA_VISIBLE_DEVICES | str | Yes | GPU device ID (default: "0") |
Outputs
| Name | Type | Description |
|---|---|---|
| HTTP Server | Service | Running SGLang server on specified port accepting generation requests |
| /get_model_info | HTTP endpoint | Returns model metadata including model_path (used by rebase.py to load AutoConfig) |
Usage Examples
Default Configuration
# Launch policy model on GPU 0, port 30000
MODEL_REPO="path/to/llemma-7b"
PORT=30000
tensor_parellel_size=1
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
--model-path $MODEL_REPO \
--port $PORT \
--tp-size $tensor_parellel_size \
--trust-remote-code
Multi-GPU Configuration
# Launch larger model with tensor parallelism across 2 GPUs
MODEL_REPO="path/to/llama-3-70b"
PORT=30000
tensor_parellel_size=2
CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server \
--model-path $MODEL_REPO \
--port $PORT \
--tp-size $tensor_parellel_size \
--trust-remote-code