Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:SqueezeAILab ETS Sglang Launch Policy Server

From Leeroopedia
Knowledge Sources
Domains Inference, Model_Serving
Last Updated 2026-02-14 02:00 GMT

Overview

Concrete tool for launching an SGLang HTTP server hosting the policy (generator) model on a dedicated GPU.

Description

This shell script launches the SGLang inference server for the policy model. It binds the server to a specific GPU using CUDA_VISIBLE_DEVICES, configures tensor parallelism, and starts serving on the specified port. The server provides an HTTP API that the tree search engine (rebase.py) connects to via RuntimeEndpoint.

Usage

Run this script in a separate terminal or background process before starting the ETS tree search. The server must be fully initialized (model loaded into GPU memory) before rebase.py is invoked.

Code Reference

Source Location

  • Repository: ETS
  • File: scripts/run_policy.sh
  • Lines: 1-10

Signature

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
    --model-path $MODEL_REPO \
    --port $PORT \
    --tp-size $tensor_parellel_size \
    --trust-remote-code

Import

# Client-side connection from rebase.py:
from sglang import RuntimeEndpoint
policy_endpoint = RuntimeEndpoint("http://localhost:30000")

I/O Contract

Inputs

Name Type Required Description
MODEL_REPO str Yes Path to policy model weights (HuggingFace model ID or local directory)
PORT int Yes HTTP port to serve on (default: 30000)
tensor_parellel_size int Yes Number of GPUs for tensor parallelism (default: 1)
CUDA_VISIBLE_DEVICES str Yes GPU device ID (default: "0")

Outputs

Name Type Description
HTTP Server Service Running SGLang server on specified port accepting generation requests
/get_model_info HTTP endpoint Returns model metadata including model_path (used by rebase.py to load AutoConfig)

Usage Examples

Default Configuration

# Launch policy model on GPU 0, port 30000
MODEL_REPO="path/to/llemma-7b"
PORT=30000
tensor_parellel_size=1

CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server \
    --model-path $MODEL_REPO \
    --port $PORT \
    --tp-size $tensor_parellel_size \
    --trust-remote-code

Multi-GPU Configuration

# Launch larger model with tensor parallelism across 2 GPUs
MODEL_REPO="path/to/llama-3-70b"
PORT=30000
tensor_parellel_size=2

CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server \
    --model-path $MODEL_REPO \
    --port $PORT \
    --tp-size $tensor_parellel_size \
    --trust-remote-code

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment