Implementation:FMInference FlexLLMGen Get Batches
| Knowledge Sources | |
|---|---|
| Domains | Benchmark_Integration, Batch_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for constructing padded evaluation batches from HELM scenario requests provided by the FlexLLMGen HELM integration.
Description
get_batches() takes a ScenarioState (containing request_states with prompts and generation parameters), a tokenizer, a batch_size, and optional pad_to_seq_len. It groups requests into batches of batch_size, tokenizes all prompts with left-padding, and returns a list of batch dictionaries containing input_ids, generation parameters (temperature, max_tokens, stop_sequences), and the original request_state references for result mapping.
Usage
Called within the HELM execute() function before the generation loop. The returned batches are iterated and fed to OptLM.generate().
Code Reference
- Source: flexllmgen/apps/helm_run.py, Lines: 131-163
- Signature:
def get_batches(scenario_state, tokenizer, batch_size, pad_to_seq_len=None):
"""Group HELM requests into batches for FlexLLMGen inference.
Args:
scenario_state: HELM ScenarioState with request_states
tokenizer: OptTokenizer or compatible tokenizer
batch_size: Number of requests per batch (gpu_batch_size * num_gpu_batches)
pad_to_seq_len: Optional fixed padding length (auto-computed if None)
Returns:
List of batch dicts with input_ids, generation params, request_states
"""
- Import:
from flexllmgen.apps.helm_run import get_batches
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| scenario_state | ScenarioState | Yes | HELM scenario with request_states |
| tokenizer | OptTokenizer | Yes | Tokenizer for encoding prompts |
| batch_size | int | Yes | Requests per batch |
| pad_to_seq_len | int | No | Padding length auto-computed if None |
Outputs
- List[dict] — list of batch dictionaries each containing input_ids numpy array, generation parameters, and request_state references
Usage Examples
from flexllmgen.apps.helm_run import get_batches, OptTokenizer
tokenizer = OptTokenizer("facebook/opt-30b")
batch_size = policy.gpu_batch_size * policy.num_gpu_batches
batches = get_batches(scenario_state, tokenizer, batch_size, pad_to_seq_len=512)
for batch in batches:
output_ids = model.generate(
batch["input_ids"],
do_sample=batch["do_sample"],
temperature=batch["temperature"],
max_new_tokens=batch["max_new_tokens"],
stop=batch.get("stop")
)