Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory KT Engine

From Leeroopedia


Knowledge Sources
Domains Inference, CPU Offloading
Last Updated 2026-02-06 19:00 GMT

Overview

KT Engine is the KTransformers inference engine implementation that enables CPU-offloaded large model inference with optional FlashInfer MLA acceleration and CUDA graph support.

Description

The KTransformersEngine class loads models using LLaMA Factory's standard model loader, then leverages KTransformers' prefill_and_generate_capture function for token generation. The engine supports several advanced features: FlashInfer MLA (Multi-head Latent Attention) acceleration for DeepSeek V2/V3 architectures on NVIDIA GPUs with compute capability >= 8, CUDA graph capture for reduced kernel launch overhead, chunked prefill for long sequences, and a force_think mode that prepends "<think>" tokens for reasoning models. The _generate async method runs generation in a background thread, streaming tokens through an asyncio.Queue. Concurrency is controlled via an asyncio Semaphore (configurable via MAX_CONCURRENT env var).

Usage

Use this engine when deploying very large models (e.g., DeepSeek V2/V3 with 236B+ parameters) on consumer hardware by setting --infer_backend ktransformers. It leverages KTransformers' CPU/GPU hybrid execution strategy to fit models that exceed single-GPU memory.

Code Reference

Source Location

Signature

class KTransformersEngine(BaseEngine):
    def __init__(
        self,
        model_args: "ModelArguments",
        data_args: "DataArguments",
        finetuning_args: "FinetuningArguments",
        generating_args: "GeneratingArguments",
    ) -> None: ...

    @staticmethod
    @torch.inference_mode()
    def _get_scores(
        model: "PreTrainedModelWrapper",
        tokenizer: "PreTrainedTokenizer",
        batch_input: list[str],
        input_kwargs: Optional[dict[str, Any]] = {},
    ) -> list[float]: ...

    async def _generate(
        self,
        messages: list[dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        **input_kwargs,
    ) -> AsyncGenerator[str, None]: ...

    async def chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...
    async def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...
    async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...

Import

from llamafactory.chat.kt_engine import KTransformersEngine

I/O Contract

Inputs

Name Type Required Description
model_args ModelArguments Yes Model configuration including kt_maxlen, kt_use_cuda_graph, kt_mode, kt_force_think, chunk_size
data_args DataArguments Yes Data configuration for template setup
finetuning_args FinetuningArguments Yes Finetuning stage ("sft" enables generation)
generating_args GeneratingArguments Yes Default generation parameters
messages list[dict[str, str]] Yes Chat messages for generation
MAX_CONCURRENT str (env var) No Maximum concurrent requests (default: "1")

Outputs

Name Type Description
list[Response] list[Response] Generated response with text, token counts, and finish reason
AsyncGenerator[str, None] async generator Token-by-token streaming output
list[float] list[float] Reward model scores (only for non-generative models)

Usage Examples

from llamafactory.chat import ChatModel

# Use KTransformers backend for large model inference
chat_model = ChatModel(args={
    "model_name_or_path": "deepseek-ai/DeepSeek-V2-Chat",
    "template": "deepseek2",
    "infer_backend": "ktransformers",
    "kt_maxlen": 4096,
    "kt_use_cuda_graph": True,
    "kt_mode": "normal",
})

messages = [{"role": "user", "content": "Explain quantum computing."}]
responses = chat_model.chat(messages)
print(responses[0].response_text)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment