Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:PacktPublishing LLM Engineers Handbook FastLanguageModel For Inference

From Leeroopedia


Field Value
Implementation Name FastLanguageModel For Inference
Type Wrapper Doc (Unsloth)
Source File llm_engineering/model/finetuning/finetune.py:L204-215
Workflow LLM_Finetuning
Repo PacktPublishing/LLM-Engineers-Handbook
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Post_Training_Inference_Validation

Function Signatures

# Switch model to inference mode
FastLanguageModel.for_inference(model) -> model

# Generate text with streaming
model.generate(
    **inputs,
    streamer: TextStreamer,
    max_new_tokens: int,
    use_cache: bool,
) -> torch.Tensor

Imports

from unsloth import FastLanguageModel
from transformers import TextStreamer

Description

This implementation performs a quick post-training inference validation by switching the fine-tuned model to inference mode and generating a sample response. FastLanguageModel.for_inference() applies inference-specific optimizations (KV-cache, disabled gradients, fused kernels), and model.generate() produces text token-by-token with real-time streaming output via TextStreamer.

Key Code in Repository

# From llm_engineering/model/finetuning/finetune.py

FastLanguageModel.for_inference(model)

prompt = "Write a paragraph to introduce supervised fine-tuning."
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
)

Step-by-Step Breakdown

Step 1: Switch to Inference Mode

FastLanguageModel.for_inference(model)

This method:

  • Disables gradient computation for all parameters.
  • Enables KV-cache for efficient autoregressive generation.
  • Activates Unsloth's optimized inference kernels.
  • Modifies the model in-place (returns the same model object).

Step 2: Tokenize the Prompt

prompt = "Write a paragraph to introduce supervised fine-tuning."
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
  • The prompt is wrapped in a list to create a batch dimension.
  • return_tensors="pt" returns PyTorch tensors.
  • .to("cuda") moves input tensors to the GPU.

Step 3: Generate with Streaming

text_streamer = TextStreamer(tokenizer)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
)
  • TextStreamer: Prints each generated token to stdout in real-time as it is decoded.
  • max_new_tokens=256: Limits generation to 256 new tokens (approximately 1-2 paragraphs).
  • use_cache=True: Enables KV-cache for efficient autoregressive generation.
  • The return value (generated token IDs) is discarded (_ =) since the streaming output is the primary purpose.

Parameters

Parameter Type Value Description
model Model The fine-tuned model with LoRA adapters.
prompt str "Write a paragraph to introduce supervised fine-tuning." The validation prompt.
max_new_tokens int 256 Maximum number of tokens to generate.
use_cache bool True Enable KV-cache for efficient generation.
streamer TextStreamer Streams decoded tokens to stdout in real-time.

Outputs

  • Primary output: Generated text streamed to stdout via TextStreamer.
  • Return value: torch.Tensor of generated token IDs (discarded in this usage).

External Dependencies

Package Purpose
unsloth Optimized inference mode via FastLanguageModel.for_inference()
transformers TextStreamer for real-time token-by-token output

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment