Implementation:Hiyouga LLaMA Factory KT Engine
| Knowledge Sources | |
|---|---|
| Domains | Inference, CPU Offloading |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
KT Engine is the KTransformers inference engine implementation that enables CPU-offloaded large model inference with optional FlashInfer MLA acceleration and CUDA graph support.
Description
The KTransformersEngine class loads models using LLaMA Factory's standard model loader, then leverages KTransformers' prefill_and_generate_capture function for token generation. The engine supports several advanced features: FlashInfer MLA (Multi-head Latent Attention) acceleration for DeepSeek V2/V3 architectures on NVIDIA GPUs with compute capability >= 8, CUDA graph capture for reduced kernel launch overhead, chunked prefill for long sequences, and a force_think mode that prepends "<think>" tokens for reasoning models. The _generate async method runs generation in a background thread, streaming tokens through an asyncio.Queue. Concurrency is controlled via an asyncio Semaphore (configurable via MAX_CONCURRENT env var).
Usage
Use this engine when deploying very large models (e.g., DeepSeek V2/V3 with 236B+ parameters) on consumer hardware by setting --infer_backend ktransformers. It leverages KTransformers' CPU/GPU hybrid execution strategy to fit models that exceed single-GPU memory.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/chat/kt_engine.py
- Lines: 1-284
Signature
class KTransformersEngine(BaseEngine):
def __init__(
self,
model_args: "ModelArguments",
data_args: "DataArguments",
finetuning_args: "FinetuningArguments",
generating_args: "GeneratingArguments",
) -> None: ...
@staticmethod
@torch.inference_mode()
def _get_scores(
model: "PreTrainedModelWrapper",
tokenizer: "PreTrainedTokenizer",
batch_input: list[str],
input_kwargs: Optional[dict[str, Any]] = {},
) -> list[float]: ...
async def _generate(
self,
messages: list[dict[str, str]],
system: Optional[str] = None,
tools: Optional[str] = None,
**input_kwargs,
) -> AsyncGenerator[str, None]: ...
async def chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...
async def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...
async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...
Import
from llamafactory.chat.kt_engine import KTransformersEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelArguments | Yes | Model configuration including kt_maxlen, kt_use_cuda_graph, kt_mode, kt_force_think, chunk_size |
| data_args | DataArguments | Yes | Data configuration for template setup |
| finetuning_args | FinetuningArguments | Yes | Finetuning stage ("sft" enables generation) |
| generating_args | GeneratingArguments | Yes | Default generation parameters |
| messages | list[dict[str, str]] | Yes | Chat messages for generation |
| MAX_CONCURRENT | str (env var) | No | Maximum concurrent requests (default: "1") |
Outputs
| Name | Type | Description |
|---|---|---|
| list[Response] | list[Response] | Generated response with text, token counts, and finish reason |
| AsyncGenerator[str, None] | async generator | Token-by-token streaming output |
| list[float] | list[float] | Reward model scores (only for non-generative models) |
Usage Examples
from llamafactory.chat import ChatModel
# Use KTransformers backend for large model inference
chat_model = ChatModel(args={
"model_name_or_path": "deepseek-ai/DeepSeek-V2-Chat",
"template": "deepseek2",
"infer_backend": "ktransformers",
"kt_maxlen": 4096,
"kt_use_cuda_graph": True,
"kt_mode": "normal",
})
messages = [{"role": "user", "content": "Explain quantum computing."}]
responses = chat_model.chat(messages)
print(responses[0].response_text)
Related Pages
- Hiyouga_LLaMA_Factory_Base_Engine - Abstract base class this engine implements
- Hiyouga_LLaMA_Factory_Chat_Model - Facade that selects and delegates to this engine
- Hiyouga_LLaMA_Factory_VLLM_Engine - Alternative vLLM-based engine
- Hiyouga_LLaMA_Factory_SGLang_Engine - Alternative SGLang-based engine