Implementation:Huggingface Transformers Model Generate

Knowledge Sources	Transformers Generation Strategies
Domains	Model_Optimization, Quantization, Inference, Text_Generation
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete API for generating text sequences from a language model (including quantized models) provided by Hugging Face Transformers.

Description

The generate() method is defined on GenerationMixin (in generation/utils.py, line 2266) and inherited by all model classes that support text generation, including AutoModelForCausalLM. It implements the full autoregressive decoding pipeline: input preparation, generation mode selection, logits processing, token sampling or beam search, and stopping criteria evaluation.

When called on a quantized model, generate() works transparently: the quantized layers handle dequantization internally during each forward() call. The method supports multiple generation modes dispatched via GenerationMode:

SAMPLE / GREEDY_SEARCH -- Standard autoregressive decoding (greedy or with sampling).
BEAM_SEARCH / BEAM_SAMPLE -- Beam-based decoding with optional sampling.
ASSISTED_GENERATION -- Speculative decoding with a draft model.

The method accepts generation parameters either through a GenerationConfig object or as keyword arguments that override the config. Key sampling parameters include temperature, top_k, top_p, do_sample, and max_new_tokens.

Usage

Use this API for any text generation task after loading a model (quantized or otherwise). It is the standard entry point for inference with transformer language models.

Code Reference

Source Location

Repository: transformers
File: src/transformers/generation/utils.py (line 2266)

Signature

class GenerationMixin:
    def generate(
        self,
        inputs: torch.Tensor | None = None,
        generation_config: GenerationConfig | None = None,
        logits_processor: LogitsProcessorList | None = None,
        stopping_criteria: StoppingCriteriaList | None = None,
        prefix_allowed_tokens_fn: Callable[[int, torch.Tensor], list[int]] | None = None,
        synced_gpus: bool | None = None,
        assistant_model: PreTrainedModel | None = None,
        streamer: BaseStreamer | None = None,
        negative_prompt_ids: torch.Tensor | None = None,
        negative_prompt_attention_mask: torch.Tensor | None = None,
        custom_generate: str | Callable | None = None,
        **kwargs,
    ) -> GenerateOutput | torch.LongTensor: ...

Import

# generate() is a method on model instances, not imported directly
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(...)
output = model.generate(...)

I/O Contract

Inputs

Name	Type	Required	Description
inputs	`torch.Tensor`	No	Input token IDs (or encoder inputs). If `None`, initialized with `bos_token_id`.
generation_config	`GenerationConfig`	No	Generation configuration object. If not provided, the model's default config is used.
max_new_tokens	`int`	No (via kwargs)	Maximum number of tokens to generate beyond the input.
do_sample	`bool`	No (via kwargs)	Whether to use sampling (True) or greedy decoding (False).
temperature	`float`	No (via kwargs)	Sampling temperature. Values > 1.0 increase randomness; values < 1.0 decrease it.
top_k	`int`	No (via kwargs)	Limits sampling to the top-k most probable tokens.
top_p	`float`	No (via kwargs)	Nucleus sampling: limits sampling to the smallest set of tokens with cumulative probability >= top_p.
num_beams	`int`	No (via kwargs)	Number of beams for beam search. 1 means no beam search.
streamer	`BaseStreamer`	No	Streamer object for real-time token streaming.
assistant_model	`PreTrainedModel`	No	Draft model for speculative/assisted decoding.
logits_processor	`LogitsProcessorList`	No	Custom logits processors for advanced control.
stopping_criteria	`StoppingCriteriaList`	No	Custom stopping criteria beyond max length and EOS token.

Outputs

Name	Type	Description
sequences	`torch.LongTensor`	Generated token ID sequences of shape `(batch_size, sequence_length)`. Returned directly when `return_dict_in_generate=False`.
output	`GenerateDecoderOnlyOutput` or `GenerateEncoderDecoderOutput`	Rich output object containing sequences, scores, logits, attentions, and hidden states. Returned when `return_dict_in_generate=True`.

Usage Examples

Basic Quantized Inference

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

inputs = tokenizer("Hello my name is", return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)

output = model.generate(input_ids=input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Sampling with Temperature and Top-p

output = model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
)

Streaming Output

from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_special_tokens=True)

model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    streamer=streamer,
)

Greedy Decoding (Deterministic)

output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    do_sample=False,
)

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Quantized_Inference

Requires Environment

Environment:Huggingface_Transformers_PyTorch_24_CUDA

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment