Implementation:FMInference FlexLLMGen OptLM Generate

Metadata

Field	Value
Repo	FlexLLMGen

Domains

Inference_Optimization
Text_Generation

Overview

Concrete tool for running autoregressive text generation with three-tier memory offloading provided by the FlexLLMGen library.

Description

OptLM.generate() wraps the full token generation pipeline. It creates a Task from inputs, allocates output_ids and hidden state buffers, initializes KV caches for all layers and batches, runs the appropriate generation loop (normal, overlap_single_batch, or overlap_multi_batch based on policy settings), and returns the complete output_ids array containing both prompt and generated tokens.

Usage

Call after model loading. Input must be tokenized input_ids as numpy array or list of lists, with shape matching (gpu_batch_size * num_gpu_batches, prompt_len). Supports greedy decoding (do_sample=False) and sampling with temperature.

Code Reference

Field	Value
Source	flexllmgen/flex_opt.py, Lines: 825-910
Import	`from flexllmgen.flex_opt import OptLM`

Signature:

def generate(self,
             inputs: Union[np.array, List[List[int]]],
             max_new_tokens: int = 32,
             do_sample: bool = False,
             temperature: float = 1.0,
             stop: Optional[int] = None,
             debug_mode: Optional[str] = None,
             cut_gen_len: Optional[int] = None,
             verbose: int = 0):
    """
    Args:
        inputs: Tokenized input_ids (batch_size x prompt_len)
        max_new_tokens: Number of new tokens to generate
        do_sample: Use sampling (True) or greedy (False)
        temperature: Sampling temperature
        stop: Stop token id for early stopping
        debug_mode: Debug mode string (None, "fewer_batch", "breakdown")
        cut_gen_len: Limit generation length (for benchmarking)
        verbose: Verbosity level
    Returns:
        np.ndarray of shape (batch_size, prompt_len + max_new_tokens)
    """

I/O Contract

Inputs:

Name	Type	Required	Description
inputs	Union[np.array, List[List[int]]]	Yes	Tokenized input_ids batch
max_new_tokens	int	No	Tokens to generate default 32
do_sample	bool	No	Sampling mode default False
temperature	float	No	Sampling temperature default 1.0
stop	Optional[int]	No	Early stop token id
debug_mode	Optional[str]	No	Debug mode
cut_gen_len	Optional[int]	No	Cut generation length
verbose	int	No	Verbosity default 0

Outputs:

Name	Type	Description
output_ids	np.ndarray shape (batch_size, prompt_len + gen_len)	Contains prompt + generated token ids

Usage Examples

import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", padding_side="left")
tokenizer.add_bos_token = False

prompts = ["Question: What is the capital of France?\nAnswer:"]
inputs = tokenizer(prompts, padding="max_length", max_length=128)

output_ids = model.generate(
    inputs.input_ids,
    max_new_tokens=32,
    do_sample=True,
    temperature=0.7,
    stop=tokenizer("\n").input_ids[0]
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment