Implementation:SqueezeAILab ETS Sglang Launch Reward Server
| Knowledge Sources | |
|---|---|
| Domains | Inference, Model_Serving, Reward_Modeling |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
Concrete tool for launching an SGLang HTTP server hosting the Process Reward Model (PRM) on a dedicated GPU.
Description
This shell script launches the SGLang inference server for the reward model. It differs from the policy server by using --mem-fraction-static 0.85, which reserves 15% of GPU memory for a collocated SentenceTransformer embedding model used for trajectory diversity scoring. The reward server is typically deployed on GPU 1 (separate from the policy model on GPU 0).
Usage
Run this script in a separate terminal or background process before starting the ETS tree search. Must be running alongside the policy model server.
Code Reference
Source Location
- Repository: ETS
- File: scripts/run_reward.sh
- Lines: 1-11
Signature
CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server \
--model-path $MODEL_REPO \
--port $PORT \
--tp-size $tensor_parellel_size \
--trust-remote-code \
--mem-fraction-static 0.85
Import
# Client-side connection from rebase.py:
from sglang import RuntimeEndpoint
reward_endpoint = RuntimeEndpoint("http://localhost:30020")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| MODEL_REPO | str | Yes | Path to PRM weights (HuggingFace model ID or local directory) |
| PORT | int | Yes | HTTP port to serve on (default: 30020) |
| tensor_parellel_size | int | Yes | Number of GPUs for tensor parallelism (default: 1) |
| CUDA_VISIBLE_DEVICES | str | Yes | GPU device ID (default: "1") |
| --mem-fraction-static | float | Yes | Fraction of GPU memory for model (default: 0.85) |
Outputs
| Name | Type | Description |
|---|---|---|
| HTTP Server | Service | Running SGLang server on specified port accepting scoring requests |
| Score backend | API | PRM scoring accessible via SGLang's set_score_backend mechanism |
Usage Examples
Default Configuration
# Launch reward model on GPU 1, port 30020
MODEL_REPO="path/to/prm-model"
PORT=30020
tensor_parellel_size=1
# Reserve 15% GPU memory for collocated embedding model
CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server \
--model-path $MODEL_REPO \
--port $PORT \
--tp-size $tensor_parellel_size \
--trust-remote-code \
--mem-fraction-static 0.85