Implementation:FMInference FlexLLMGen OptLM Generate
Metadata
| Field | Value |
|---|---|
| Repo | FlexLLMGen |
Domains
- Inference_Optimization
- Text_Generation
Overview
Concrete tool for running autoregressive text generation with three-tier memory offloading provided by the FlexLLMGen library.
Description
OptLM.generate() wraps the full token generation pipeline. It creates a Task from inputs, allocates output_ids and hidden state buffers, initializes KV caches for all layers and batches, runs the appropriate generation loop (normal, overlap_single_batch, or overlap_multi_batch based on policy settings), and returns the complete output_ids array containing both prompt and generated tokens.
Usage
Call after model loading. Input must be tokenized input_ids as numpy array or list of lists, with shape matching (gpu_batch_size * num_gpu_batches, prompt_len). Supports greedy decoding (do_sample=False) and sampling with temperature.
Code Reference
| Field | Value |
|---|---|
| Source | flexllmgen/flex_opt.py, Lines: 825-910 |
| Import | from flexllmgen.flex_opt import OptLM
|
Signature:
def generate(self,
inputs: Union[np.array, List[List[int]]],
max_new_tokens: int = 32,
do_sample: bool = False,
temperature: float = 1.0,
stop: Optional[int] = None,
debug_mode: Optional[str] = None,
cut_gen_len: Optional[int] = None,
verbose: int = 0):
"""
Args:
inputs: Tokenized input_ids (batch_size x prompt_len)
max_new_tokens: Number of new tokens to generate
do_sample: Use sampling (True) or greedy (False)
temperature: Sampling temperature
stop: Stop token id for early stopping
debug_mode: Debug mode string (None, "fewer_batch", "breakdown")
cut_gen_len: Limit generation length (for benchmarking)
verbose: Verbosity level
Returns:
np.ndarray of shape (batch_size, prompt_len + max_new_tokens)
"""
I/O Contract
Inputs:
| Name | Type | Required | Description |
|---|---|---|---|
| inputs | Union[np.array, List[List[int]]] | Yes | Tokenized input_ids batch |
| max_new_tokens | int | No | Tokens to generate default 32 |
| do_sample | bool | No | Sampling mode default False |
| temperature | float | No | Sampling temperature default 1.0 |
| stop | Optional[int] | No | Early stop token id |
| debug_mode | Optional[str] | No | Debug mode |
| cut_gen_len | Optional[int] | No | Cut generation length |
| verbose | int | No | Verbosity default 0 |
Outputs:
| Name | Type | Description |
|---|---|---|
| output_ids | np.ndarray shape (batch_size, prompt_len + gen_len) | Contains prompt + generated token ids |
Usage Examples
import numpy as np
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", padding_side="left")
tokenizer.add_bos_token = False
prompts = ["Question: What is the capital of France?\nAnswer:"]
inputs = tokenizer(prompts, padding="max_length", max_length=128)
output_ids = model.generate(
inputs.input_ids,
max_new_tokens=32,
do_sample=True,
temperature=0.7,
stop=tokenizer("\n").input_ids[0]
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)