Implementation:SqueezeAILab ETS Sglang Launch Policy Server

Knowledge Sources	ETS sglang-ets
Domains	Inference, Model_Serving
Last Updated	2026-02-14 02:00 GMT

Overview

Concrete tool for launching an SGLang HTTP server hosting the policy (generator) model on a dedicated GPU.

Description

This shell script launches the SGLang inference server for the policy model. It binds the server to a specific GPU using CUDA_VISIBLE_DEVICES, configures tensor parallelism, and starts serving on the specified port. The server provides an HTTP API that the tree search engine (rebase.py) connects to via RuntimeEndpoint.

Usage

Run this script in a separate terminal or background process before starting the ETS tree search. The server must be fully initialized (model loaded into GPU memory) before rebase.py is invoked.

Code Reference

Source Location

Repository: ETS
File: scripts/run_policy.sh
Lines: 1-10

Signature

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
    --model-path $MODEL_REPO \
    --port $PORT \
    --tp-size $tensor_parellel_size \
    --trust-remote-code

Import

# Client-side connection from rebase.py:
from sglang import RuntimeEndpoint
policy_endpoint = RuntimeEndpoint("http://localhost:30000")

I/O Contract

Inputs

Name	Type	Required	Description
MODEL_REPO	str	Yes	Path to policy model weights (HuggingFace model ID or local directory)
PORT	int	Yes	HTTP port to serve on (default: 30000)
tensor_parellel_size	int	Yes	Number of GPUs for tensor parallelism (default: 1)
CUDA_VISIBLE_DEVICES	str	Yes	GPU device ID (default: "0")

Outputs

Name	Type	Description
HTTP Server	Service	Running SGLang server on specified port accepting generation requests
/get_model_info	HTTP endpoint	Returns model metadata including model_path (used by rebase.py to load AutoConfig)

Usage Examples

Default Configuration

# Launch policy model on GPU 0, port 30000
MODEL_REPO="path/to/llemma-7b"
PORT=30000
tensor_parellel_size=1

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
    --model-path $MODEL_REPO \
    --port $PORT \
    --tp-size $tensor_parellel_size \
    --trust-remote-code

Multi-GPU Configuration

# Launch larger model with tensor parallelism across 2 GPUs
MODEL_REPO="path/to/llama-3-70b"
PORT=30000
tensor_parellel_size=2

CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server \
    --model-path $MODEL_REPO \
    --port $PORT \
    --tp-size $tensor_parellel_size \
    --trust-remote-code

Related Pages

Implements Principle

Principle:SqueezeAILab_ETS_Policy_Model_Serving

Requires Environment

Environment:SqueezeAILab_ETS_Multi_GPU_Sglang_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment