Implementation:Huggingface Transformers Model Generate
| Knowledge Sources | |
|---|---|
| Domains | Model_Optimization, Quantization, Inference, Text_Generation |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete API for generating text sequences from a language model (including quantized models) provided by Hugging Face Transformers.
Description
The generate() method is defined on GenerationMixin (in generation/utils.py, line 2266) and inherited by all model classes that support text generation, including AutoModelForCausalLM. It implements the full autoregressive decoding pipeline: input preparation, generation mode selection, logits processing, token sampling or beam search, and stopping criteria evaluation.
When called on a quantized model, generate() works transparently: the quantized layers handle dequantization internally during each forward() call. The method supports multiple generation modes dispatched via GenerationMode:
- SAMPLE / GREEDY_SEARCH -- Standard autoregressive decoding (greedy or with sampling).
- BEAM_SEARCH / BEAM_SAMPLE -- Beam-based decoding with optional sampling.
- ASSISTED_GENERATION -- Speculative decoding with a draft model.
The method accepts generation parameters either through a GenerationConfig object or as keyword arguments that override the config. Key sampling parameters include temperature, top_k, top_p, do_sample, and max_new_tokens.
Usage
Use this API for any text generation task after loading a model (quantized or otherwise). It is the standard entry point for inference with transformer language models.
Code Reference
Source Location
- Repository: transformers
- File:
src/transformers/generation/utils.py(line 2266)
Signature
class GenerationMixin:
def generate(
self,
inputs: torch.Tensor | None = None,
generation_config: GenerationConfig | None = None,
logits_processor: LogitsProcessorList | None = None,
stopping_criteria: StoppingCriteriaList | None = None,
prefix_allowed_tokens_fn: Callable[[int, torch.Tensor], list[int]] | None = None,
synced_gpus: bool | None = None,
assistant_model: PreTrainedModel | None = None,
streamer: BaseStreamer | None = None,
negative_prompt_ids: torch.Tensor | None = None,
negative_prompt_attention_mask: torch.Tensor | None = None,
custom_generate: str | Callable | None = None,
**kwargs,
) -> GenerateOutput | torch.LongTensor: ...
Import
# generate() is a method on model instances, not imported directly
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(...)
output = model.generate(...)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| inputs | torch.Tensor |
No | Input token IDs (or encoder inputs). If None, initialized with bos_token_id.
|
| generation_config | GenerationConfig |
No | Generation configuration object. If not provided, the model's default config is used. |
| max_new_tokens | int |
No (via kwargs) | Maximum number of tokens to generate beyond the input. |
| do_sample | bool |
No (via kwargs) | Whether to use sampling (True) or greedy decoding (False). |
| temperature | float |
No (via kwargs) | Sampling temperature. Values > 1.0 increase randomness; values < 1.0 decrease it. |
| top_k | int |
No (via kwargs) | Limits sampling to the top-k most probable tokens. |
| top_p | float |
No (via kwargs) | Nucleus sampling: limits sampling to the smallest set of tokens with cumulative probability >= top_p. |
| num_beams | int |
No (via kwargs) | Number of beams for beam search. 1 means no beam search. |
| streamer | BaseStreamer |
No | Streamer object for real-time token streaming. |
| assistant_model | PreTrainedModel |
No | Draft model for speculative/assisted decoding. |
| logits_processor | LogitsProcessorList |
No | Custom logits processors for advanced control. |
| stopping_criteria | StoppingCriteriaList |
No | Custom stopping criteria beyond max length and EOS token. |
Outputs
| Name | Type | Description |
|---|---|---|
| sequences | torch.LongTensor |
Generated token ID sequences of shape (batch_size, sequence_length). Returned directly when return_dict_in_generate=False.
|
| output | GenerateDecoderOnlyOutput or GenerateEncoderDecoderOutput |
Rich output object containing sequences, scores, logits, attentions, and hidden states. Returned when return_dict_in_generate=True.
|
Usage Examples
Basic Quantized Inference
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hello my name is", return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)
output = model.generate(input_ids=input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Sampling with Temperature and Top-p
output = model.generate(
input_ids=input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95,
)
Streaming Output
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
model.generate(
input_ids=input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
streamer=streamer,
)
Greedy Decoding (Deterministic)
output = model.generate(
input_ids=input_ids,
max_new_tokens=100,
do_sample=False,
)