Heuristic:SqueezeAILab ETS Embedding Model GPU Collocation

Knowledge Sources	SqueezeAILab ETS
Domains	Optimization, Infrastructure
Last Updated	2026-02-14 02:30 GMT

Overview

Collocate the SentenceTransformer embedding model on the same GPU as the reward model, and reduce the reward server's static memory fraction to 0.85 to leave VRAM headroom.

Description

The ETS system optionally uses a SentenceTransformer model (`math-similarity/Bert-MLM_arXiv-MP-class_zbMath`) for computing trajectory diversity via cosine-similarity clustering. This embedding model must be placed on a GPU for reasonable performance. Rather than requiring a third GPU, the system collocates this embedding model on the same GPU as the reward model (GPU 1 by default). To prevent CUDA out-of-memory errors, the reward server's `--mem-fraction-static` parameter is reduced from the default (typically 0.9) to 0.85, freeing approximately 5% of VRAM for the embedding model.

Usage

Apply this heuristic when running ETS with diversity-aware selection enabled (`lambdas > 0` in the YAML config). If `lambdas` is 0, the embedding model is not loaded and this heuristic is not needed. The `--embed_device` CLI argument must match the GPU running the reward model.

The Insight (Rule of Thumb)

Action: Set `--mem-fraction-static 0.85` on the reward model server and set `--embed_device` to the same GPU index as the reward model.
Value: `0.85` memory fraction (vs default ~0.9); `embed_device=1` matches reward model GPU.
Trade-off: The reward model has slightly less VRAM available (85% vs 90%), which may reduce its maximum batch size. The embedding model gets the remaining headroom.

Reasoning

The SentenceTransformer model is relatively small (BERT-based) compared to the LLM reward model, so it fits comfortably in the ~15% VRAM headroom. Placing it on a separate GPU would waste resources, while placing it on GPU 0 (policy model) would reduce the policy model's available memory — which is more critical since the policy model handles KV cache for all active tree branches.

The reward model GPU is preferred for collocation because:

The reward model only does forward passes with `max_tokens=0` (scoring), requiring less KV cache than the generative policy model.
The embedding model is only invoked once per select-and-expand step (not per-token), so GPU memory contention is intermittent.

Code Evidence

Memory fraction flag in reward server script from `scripts/run_reward.sh:10-11`:

# use --mem-fraction-static 0.85 if using collocated embedding model
CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path $MODEL_REPO --port $PORT --tp-size $tensor_parellel_size --trust-remote-code --mem-fraction-static 0.85

Embedding device argument from `scripts/ets_sweep_math500.sh:23`:

--embed_device 1 # assumes reward model is on GPU 1

Conditional embedding model initialization from `rebase.py:726-735`:

if args.embed_device is not None:
    device = "cuda:" + str(args.embed_device)
else:
    device = "cpu"

# initialize sentence transformer
if paras.get("lambdas", 0) > 0:
    multimodel = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath', device=device)
else:
    multimodel = None

README guidance from `README.md:42-43`:

"This script assumes that the generator model is on GPU 0 and the reward model is on GPU 1. If the reward model is on another GPU, make sure the embed_device parameter in this script matches the GPU that the reward model is on in order to collocate the embedding model and reward model."

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment