Principle:Turboderp org Exllamav2 Dynamic Text Generation
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, NLP, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Autoregressive text generation produces tokens one at a time by sampling from the model's output probability distribution, conditioned on all previous tokens including the input prompt.
Description
The generate() method of the Dynamic Generator provides a high-level, blocking API for text generation that abstracts away the complexities of the underlying paged-attention batching system. It handles the full generation pipeline:
- Prompt encoding: The text prompt is tokenized into input IDs using the tokenizer.
- Job creation: One or more generation jobs are created and enqueued in the dynamic generator's job queue.
- Prefill: The prompt tokens are processed through the model to populate the KV cache (possibly in chunks for long prompts).
- Iterative decoding: Tokens are generated one at a time (or in speculative batches), with each new token sampled from the model's output logit distribution according to the configured sampling settings.
- Stop condition checking: After each token, the generator checks for end-of-sequence tokens, stop strings, maximum token limits, and other termination conditions.
- Result collection: Generated text is decoded from token IDs and returned.
The method supports several advanced features:
- Batched generation: Passing a list of prompts generates completions for all of them concurrently using the dynamic batching system.
- Token healing: Rewinds the last token of the prompt and re-generates it to avoid tokenization artifacts at the prompt-completion boundary.
- Banned strings: Prevents specific text patterns from appearing in the output by backtracking when a banned string is about to be generated.
- Constrained decoding: Applies filters (grammar constraints, JSON schema, regex) to restrict which tokens can be sampled at each step.
- Classifier-free guidance (CFG): Generates with both conditioned (prompted) and unconditioned (empty prompt) forward passes, then interpolates their logits to strengthen adherence to the prompt.
Usage
Use generate() when you need a simple, blocking API for text generation. This is the recommended entry point for:
- Single-shot completions
- Batch generation of multiple prompts
- Scripts and notebooks where streaming is not needed
- Any scenario where you want results returned as complete strings
For streaming (token-by-token) output, use the job-based API or the Streaming Generator instead.
Theoretical Basis
Autoregressive Generation
# Given prompt tokens x_1, x_2, ..., x_n
# Generate continuation x_{n+1}, x_{n+2}, ..., x_{n+m}
for i in range(n+1, n+m+1):
# Forward pass: compute probability distribution over vocabulary
logits = model(x_1, ..., x_{i-1}) # shape: (vocab_size,)
# Apply sampling strategy (temperature, top-k, top-p, etc.)
probs = sampling_strategy(logits)
# Sample next token
x_i = sample(probs)
# Check stop conditions
if x_i == eos_token or meets_stop_conditions(x_i):
break
Token Healing
# Problem: tokenization artifacts at prompt-completion boundary
# Example: "Hello " might tokenize as ["Hello", " "]
# But "Hello world" tokenizes as ["Hello", " world"]
# Without healing, generation starts after " " and may produce awkward text
# Solution: remove last token, let model re-generate it
prompt_ids = tokenize("Hello ") # [Hello, " "]
healed_ids = prompt_ids[:-1] # [Hello]
# Generate from [Hello], allowing model to produce " world" naturally
Classifier-Free Guidance
# CFG interpolates between conditioned and unconditioned logits:
# logits_final = logits_unconditioned + cfg_scale * (logits_conditioned - logits_unconditioned)
#
# cfg_scale > 1.0 strengthens the prompt's influence
# cfg_scale = 1.0 is equivalent to normal generation