Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Hiyouga LLaMA Factory Model Inference and Serving

From Leeroopedia


Knowledge Sources
Domains LLMs, Inference, API_Serving, Deployment
Last Updated 2026-02-06 19:00 GMT

Overview

End-to-end process for loading a fine-tuned language model and serving it through interactive chat, web interface, or OpenAI-compatible API endpoints.

Description

This workflow covers deploying fine-tuned models for inference using LLaMA-Factory's unified inference system. The framework provides four inference backends (HuggingFace Transformers, vLLM, SGLang, and KTransformers) behind a common BaseEngine interface, enabling consistent behavior across deployment targets. Models can be served through three interfaces: CLI chat for interactive testing, a Gradio web chat interface, or a FastAPI-based OpenAI-compatible API server. The workflow supports loading both full fine-tuned models and base models with LoRA adapters, applying the correct chat template automatically.

Usage

Execute this workflow after training is complete and you need to use the model for inference. Choose CLI chat for quick testing, web chat for demonstration, or the API server for production integration. The API server provides drop-in compatibility with OpenAI's chat completions API, allowing existing client code to work without modification.

Execution Steps

Step 1: Configuration

Define the inference configuration specifying the model path, optional adapter path, inference backend, and generation parameters. The configuration is minimal compared to training, typically requiring only the model name and adapter checkpoint path.

Key considerations:

  • Set model_name_or_path to the base model or merged model path
  • For LoRA models, set adapter_name_or_path to the adapter checkpoint
  • Choose the inference backend: default (HuggingFace), vllm, sglang, or ktransformers
  • Generation parameters (temperature, top_p, max_new_tokens) can be set in config or at request time

Step 2: Model and Tokenizer Loading

Load the model, tokenizer, and processor with the appropriate configuration for inference. The model is loaded in evaluation mode with no gradient computation. If an adapter path is specified, the LoRA weights are loaded and merged or applied on-the-fly.

What happens:

  • The tokenizer is loaded with the model's chat template configuration
  • The model is loaded with inference-optimized settings (no gradient checkpointing, eval mode)
  • For LoRA models: adapter weights are loaded and optionally merged into the base model
  • Quantization can be applied for memory-efficient inference (4-bit, 8-bit)
  • KV cache type is configured for optimized generation

Step 3: Engine Initialization

Initialize the selected inference engine that wraps the model with generation capabilities. The engine provides both synchronous and asynchronous interfaces, handles chat template formatting, manages generation parameters, and supports streaming output.

What happens:

  • The ChatModel facade selects the appropriate engine based on configuration:
    • HuggingFace engine: Direct model.generate() with TextIteratorStreamer for streaming
    • vLLM engine: AsyncLLMEngine with continuous batching for high throughput
    • SGLang engine: HTTP-based communication with auto-managed server process
    • KTransformers engine: CPU-offloaded inference for large MoE models
  • The engine registers the chat template for proper message formatting
  • Tool/function calling support is configured if the model supports it

Step 4: Interface Launch

Launch the selected user interface for interacting with the model. Each interface mode provides different capabilities suited to different use cases.

Interface modes:

  • CLI Chat (llamafactory-cli chat): Interactive terminal chat for testing, supports multi-turn conversation and system prompts
  • Web Chat (llamafactory-cli webchat): Gradio-based web interface with multimodal input support (images, video, audio)
  • API Server (llamafactory-cli api): FastAPI server exposing OpenAI-compatible endpoints:
    • POST /v1/chat/completions for chat inference (streaming and non-streaming)
    • POST /v1/completions for text completion
    • GET /v1/models for model listing
    • POST /v1/score/evaluation for reward model scoring

Step 5: Request Processing

For each incoming request, the engine formats the conversation using the model's chat template, generates a response with the specified parameters, and returns the output. Streaming responses are supported for real-time output.

What happens:

  • Input messages are formatted using the model-specific chat template
  • Tool definitions are injected if function calling is requested
  • The engine runs generation with configured parameters (temperature, top_p, repetition_penalty)
  • For streaming: tokens are yielded as they are generated
  • For API mode: responses are formatted as OpenAI-compatible JSON

Execution Diagram

GitHub URL

Workflow Repository