Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Transformers Model Generate

From Leeroopedia
Knowledge Sources
Domains Model_Optimization, Quantization, Inference, Text_Generation
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete API for generating text sequences from a language model (including quantized models) provided by Hugging Face Transformers.

Description

The generate() method is defined on GenerationMixin (in generation/utils.py, line 2266) and inherited by all model classes that support text generation, including AutoModelForCausalLM. It implements the full autoregressive decoding pipeline: input preparation, generation mode selection, logits processing, token sampling or beam search, and stopping criteria evaluation.

When called on a quantized model, generate() works transparently: the quantized layers handle dequantization internally during each forward() call. The method supports multiple generation modes dispatched via GenerationMode:

  • SAMPLE / GREEDY_SEARCH -- Standard autoregressive decoding (greedy or with sampling).
  • BEAM_SEARCH / BEAM_SAMPLE -- Beam-based decoding with optional sampling.
  • ASSISTED_GENERATION -- Speculative decoding with a draft model.

The method accepts generation parameters either through a GenerationConfig object or as keyword arguments that override the config. Key sampling parameters include temperature, top_k, top_p, do_sample, and max_new_tokens.

Usage

Use this API for any text generation task after loading a model (quantized or otherwise). It is the standard entry point for inference with transformer language models.

Code Reference

Source Location

  • Repository: transformers
  • File: src/transformers/generation/utils.py (line 2266)

Signature

class GenerationMixin:
    def generate(
        self,
        inputs: torch.Tensor | None = None,
        generation_config: GenerationConfig | None = None,
        logits_processor: LogitsProcessorList | None = None,
        stopping_criteria: StoppingCriteriaList | None = None,
        prefix_allowed_tokens_fn: Callable[[int, torch.Tensor], list[int]] | None = None,
        synced_gpus: bool | None = None,
        assistant_model: PreTrainedModel | None = None,
        streamer: BaseStreamer | None = None,
        negative_prompt_ids: torch.Tensor | None = None,
        negative_prompt_attention_mask: torch.Tensor | None = None,
        custom_generate: str | Callable | None = None,
        **kwargs,
    ) -> GenerateOutput | torch.LongTensor: ...

Import

# generate() is a method on model instances, not imported directly
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(...)
output = model.generate(...)

I/O Contract

Inputs

Name Type Required Description
inputs torch.Tensor No Input token IDs (or encoder inputs). If None, initialized with bos_token_id.
generation_config GenerationConfig No Generation configuration object. If not provided, the model's default config is used.
max_new_tokens int No (via kwargs) Maximum number of tokens to generate beyond the input.
do_sample bool No (via kwargs) Whether to use sampling (True) or greedy decoding (False).
temperature float No (via kwargs) Sampling temperature. Values > 1.0 increase randomness; values < 1.0 decrease it.
top_k int No (via kwargs) Limits sampling to the top-k most probable tokens.
top_p float No (via kwargs) Nucleus sampling: limits sampling to the smallest set of tokens with cumulative probability >= top_p.
num_beams int No (via kwargs) Number of beams for beam search. 1 means no beam search.
streamer BaseStreamer No Streamer object for real-time token streaming.
assistant_model PreTrainedModel No Draft model for speculative/assisted decoding.
logits_processor LogitsProcessorList No Custom logits processors for advanced control.
stopping_criteria StoppingCriteriaList No Custom stopping criteria beyond max length and EOS token.

Outputs

Name Type Description
sequences torch.LongTensor Generated token ID sequences of shape (batch_size, sequence_length). Returned directly when return_dict_in_generate=False.
output GenerateDecoderOnlyOutput or GenerateEncoderDecoderOutput Rich output object containing sequences, scores, logits, attentions, and hidden states. Returned when return_dict_in_generate=True.

Usage Examples

Basic Quantized Inference

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

inputs = tokenizer("Hello my name is", return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)

output = model.generate(input_ids=input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Sampling with Temperature and Top-p

output = model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
)

Streaming Output

from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_special_tokens=True)

model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    streamer=streamer,
)

Greedy Decoding (Deterministic)

output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    do_sample=False,
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment