Workflow:Hiyouga LLaMA Factory Model Inference and Serving
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference, API_Serving, Deployment |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
End-to-end process for loading a fine-tuned language model and serving it through interactive chat, web interface, or OpenAI-compatible API endpoints.
Description
This workflow covers deploying fine-tuned models for inference using LLaMA-Factory's unified inference system. The framework provides four inference backends (HuggingFace Transformers, vLLM, SGLang, and KTransformers) behind a common BaseEngine interface, enabling consistent behavior across deployment targets. Models can be served through three interfaces: CLI chat for interactive testing, a Gradio web chat interface, or a FastAPI-based OpenAI-compatible API server. The workflow supports loading both full fine-tuned models and base models with LoRA adapters, applying the correct chat template automatically.
Usage
Execute this workflow after training is complete and you need to use the model for inference. Choose CLI chat for quick testing, web chat for demonstration, or the API server for production integration. The API server provides drop-in compatibility with OpenAI's chat completions API, allowing existing client code to work without modification.
Execution Steps
Step 1: Configuration
Define the inference configuration specifying the model path, optional adapter path, inference backend, and generation parameters. The configuration is minimal compared to training, typically requiring only the model name and adapter checkpoint path.
Key considerations:
- Set
model_name_or_pathto the base model or merged model path - For LoRA models, set
adapter_name_or_pathto the adapter checkpoint - Choose the inference backend: default (HuggingFace), vllm, sglang, or ktransformers
- Generation parameters (temperature, top_p, max_new_tokens) can be set in config or at request time
Step 2: Model and Tokenizer Loading
Load the model, tokenizer, and processor with the appropriate configuration for inference. The model is loaded in evaluation mode with no gradient computation. If an adapter path is specified, the LoRA weights are loaded and merged or applied on-the-fly.
What happens:
- The tokenizer is loaded with the model's chat template configuration
- The model is loaded with inference-optimized settings (no gradient checkpointing, eval mode)
- For LoRA models: adapter weights are loaded and optionally merged into the base model
- Quantization can be applied for memory-efficient inference (4-bit, 8-bit)
- KV cache type is configured for optimized generation
Step 3: Engine Initialization
Initialize the selected inference engine that wraps the model with generation capabilities. The engine provides both synchronous and asynchronous interfaces, handles chat template formatting, manages generation parameters, and supports streaming output.
What happens:
- The ChatModel facade selects the appropriate engine based on configuration:
- HuggingFace engine: Direct model.generate() with TextIteratorStreamer for streaming
- vLLM engine: AsyncLLMEngine with continuous batching for high throughput
- SGLang engine: HTTP-based communication with auto-managed server process
- KTransformers engine: CPU-offloaded inference for large MoE models
- The engine registers the chat template for proper message formatting
- Tool/function calling support is configured if the model supports it
Step 4: Interface Launch
Launch the selected user interface for interacting with the model. Each interface mode provides different capabilities suited to different use cases.
Interface modes:
- CLI Chat (
llamafactory-cli chat): Interactive terminal chat for testing, supports multi-turn conversation and system prompts - Web Chat (
llamafactory-cli webchat): Gradio-based web interface with multimodal input support (images, video, audio) - API Server (
llamafactory-cli api): FastAPI server exposing OpenAI-compatible endpoints:POST /v1/chat/completionsfor chat inference (streaming and non-streaming)POST /v1/completionsfor text completionGET /v1/modelsfor model listingPOST /v1/score/evaluationfor reward model scoring
Step 5: Request Processing
For each incoming request, the engine formats the conversation using the model's chat template, generates a response with the specified parameters, and returns the output. Streaming responses are supported for real-time output.
What happens:
- Input messages are formatted using the model-specific chat template
- Tool definitions are injected if function calling is requested
- The engine runs generation with configured parameters (temperature, top_p, repetition_penalty)
- For streaming: tokens are yielded as they are generated
- For API mode: responses are formatted as OpenAI-compatible JSON