Implementation:PacktPublishing LLM Engineers Handbook FastLanguageModel For Inference
Appearance
| Field | Value |
|---|---|
| Implementation Name | FastLanguageModel For Inference |
| Type | Wrapper Doc (Unsloth) |
| Source File | llm_engineering/model/finetuning/finetune.py:L204-215 |
| Workflow | LLM_Finetuning |
| Repo | PacktPublishing/LLM-Engineers-Handbook |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Post_Training_Inference_Validation |
Function Signatures
# Switch model to inference mode
FastLanguageModel.for_inference(model) -> model
# Generate text with streaming
model.generate(
**inputs,
streamer: TextStreamer,
max_new_tokens: int,
use_cache: bool,
) -> torch.Tensor
Imports
from unsloth import FastLanguageModel
from transformers import TextStreamer
Description
This implementation performs a quick post-training inference validation by switching the fine-tuned model to inference mode and generating a sample response. FastLanguageModel.for_inference() applies inference-specific optimizations (KV-cache, disabled gradients, fused kernels), and model.generate() produces text token-by-token with real-time streaming output via TextStreamer.
Key Code in Repository
# From llm_engineering/model/finetuning/finetune.py
FastLanguageModel.for_inference(model)
prompt = "Write a paragraph to introduce supervised fine-tuning."
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(
**inputs,
streamer=text_streamer,
max_new_tokens=256,
use_cache=True,
)
Step-by-Step Breakdown
Step 1: Switch to Inference Mode
FastLanguageModel.for_inference(model)
This method:
- Disables gradient computation for all parameters.
- Enables KV-cache for efficient autoregressive generation.
- Activates Unsloth's optimized inference kernels.
- Modifies the model in-place (returns the same model object).
Step 2: Tokenize the Prompt
prompt = "Write a paragraph to introduce supervised fine-tuning."
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
- The prompt is wrapped in a list to create a batch dimension.
return_tensors="pt"returns PyTorch tensors..to("cuda")moves input tensors to the GPU.
Step 3: Generate with Streaming
text_streamer = TextStreamer(tokenizer)
_ = model.generate(
**inputs,
streamer=text_streamer,
max_new_tokens=256,
use_cache=True,
)
TextStreamer: Prints each generated token to stdout in real-time as it is decoded.max_new_tokens=256: Limits generation to 256 new tokens (approximately 1-2 paragraphs).use_cache=True: Enables KV-cache for efficient autoregressive generation.- The return value (generated token IDs) is discarded (
_ =) since the streaming output is the primary purpose.
Parameters
| Parameter | Type | Value | Description |
|---|---|---|---|
model |
Model | — | The fine-tuned model with LoRA adapters. |
prompt |
str |
"Write a paragraph to introduce supervised fine-tuning." |
The validation prompt. |
max_new_tokens |
int |
256 |
Maximum number of tokens to generate. |
use_cache |
bool |
True |
Enable KV-cache for efficient generation. |
streamer |
TextStreamer |
— | Streams decoded tokens to stdout in real-time. |
Outputs
- Primary output: Generated text streamed to stdout via
TextStreamer. - Return value:
torch.Tensorof generated token IDs (discarded in this usage).
External Dependencies
| Package | Purpose |
|---|---|
unsloth |
Optimized inference mode via FastLanguageModel.for_inference()
|
transformers |
TextStreamer for real-time token-by-token output
|
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment