Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FMInference FlexLLMGen OptLM Generate

From Leeroopedia


Metadata

Field Value
Repo FlexLLMGen

Domains

  • Inference_Optimization
  • Text_Generation

Overview

Concrete tool for running autoregressive text generation with three-tier memory offloading provided by the FlexLLMGen library.

Description

OptLM.generate() wraps the full token generation pipeline. It creates a Task from inputs, allocates output_ids and hidden state buffers, initializes KV caches for all layers and batches, runs the appropriate generation loop (normal, overlap_single_batch, or overlap_multi_batch based on policy settings), and returns the complete output_ids array containing both prompt and generated tokens.

Usage

Call after model loading. Input must be tokenized input_ids as numpy array or list of lists, with shape matching (gpu_batch_size * num_gpu_batches, prompt_len). Supports greedy decoding (do_sample=False) and sampling with temperature.

Code Reference

Field Value
Source flexllmgen/flex_opt.py, Lines: 825-910
Import from flexllmgen.flex_opt import OptLM

Signature:

def generate(self,
             inputs: Union[np.array, List[List[int]]],
             max_new_tokens: int = 32,
             do_sample: bool = False,
             temperature: float = 1.0,
             stop: Optional[int] = None,
             debug_mode: Optional[str] = None,
             cut_gen_len: Optional[int] = None,
             verbose: int = 0):
    """
    Args:
        inputs: Tokenized input_ids (batch_size x prompt_len)
        max_new_tokens: Number of new tokens to generate
        do_sample: Use sampling (True) or greedy (False)
        temperature: Sampling temperature
        stop: Stop token id for early stopping
        debug_mode: Debug mode string (None, "fewer_batch", "breakdown")
        cut_gen_len: Limit generation length (for benchmarking)
        verbose: Verbosity level
    Returns:
        np.ndarray of shape (batch_size, prompt_len + max_new_tokens)
    """

I/O Contract

Inputs:

Name Type Required Description
inputs Union[np.array, List[List[int]]] Yes Tokenized input_ids batch
max_new_tokens int No Tokens to generate default 32
do_sample bool No Sampling mode default False
temperature float No Sampling temperature default 1.0
stop Optional[int] No Early stop token id
debug_mode Optional[str] No Debug mode
cut_gen_len Optional[int] No Cut generation length
verbose int No Verbosity default 0

Outputs:

Name Type Description
output_ids np.ndarray shape (batch_size, prompt_len + gen_len) Contains prompt + generated token ids

Usage Examples

import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", padding_side="left")
tokenizer.add_bos_token = False

prompts = ["Question: What is the capital of France?\nAnswer:"]
inputs = tokenizer(prompts, padding="max_length", max_length=128)

output_ids = model.generate(
    inputs.input_ids,
    max_new_tokens=32,
    do_sample=True,
    temperature=0.7,
    stop=tokenizer("\n").input_ids[0]
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment