Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Mlc ai Mlc llm Disaggregated Serving

From Leeroopedia


Knowledge Sources
Domains LLMs, Distributed_Serving, Microserving, KV_Cache_Management
Last Updated 2026-02-09 20:00 GMT

Overview

End-to-end process for deploying MLC-LLM with disaggregated prefill-decode serving across multiple engine instances using the microserving API and a custom router for cross-engine KV cache orchestration.

Description

This workflow implements disaggregated serving, an advanced deployment pattern that separates the compute-intensive prefill phase from the memory-bound decode phase onto different engine instances (potentially on different GPUs). A custom router orchestrates the three-phase handoff: preparing the decode engine to receive KV cache data, sending the prefilled KV cache from the prefill engine to the decode engine, and starting token generation on the decode engine. This architecture improves overall throughput by allowing each engine to specialize in its respective phase, and supports preemption-based rescheduling when decode capacity is constrained.

Key outputs:

  • Multi-engine serving deployment with specialized prefill and decode instances
  • Custom router handling request orchestration across engines
  • Cross-engine KV cache transfer via microserving protocol
  • Streaming responses to clients with preemption handling

Usage

Execute this workflow when you need to maximize inference throughput for high-concurrency LLM serving scenarios, particularly when prefill latency for long prompts is a bottleneck. Disaggregated serving is appropriate for production deployments with multiple GPUs where separating prefill and decode workloads can improve overall system utilization. This is an advanced deployment pattern that builds upon the standard REST API serving workflow.

Execution Steps

Step 1: Prepare multi-engine infrastructure

Set up the hardware and software environment for running multiple MLC-LLM engine instances. Each engine instance requires its own GPU allocation. The prefill engine handles prompt processing and KV cache generation, while the decode engine handles autoregressive token generation. Both engines must load the same model with compatible configurations, and each can use tensor parallelism across multiple GPUs within its instance.

Key considerations:

  • Each engine instance is assigned a dedicated set of GPUs via device_id_starts
  • The prefill engine and decode engine can have different tensor parallelism configurations
  • Network connectivity between engines must support the aiohttp-based KV cache transfer protocol
  • Both engines must load identical model weights and use the same model library

Step 2: Implement the custom router

Create a custom router class extending the base Router that implements the translate_request method. This method defines the orchestration logic for disaggregated serving: it receives incoming chat completion requests and coordinates the three-phase handoff between prefill and decode engines. The router manages request lifecycle, handles preemption signals from the decode engine, and yields streaming responses back to the client.

Key considerations:

  • The router subclasses mlc_llm.router.Router and overrides translate_request
  • translate_request is an async generator that yields completion responses
  • Preemption handling requires yielding None when the decode engine returns finish_reason "preempt"
  • The router has access to server_urls and device_id_starts for directing requests to the correct engines

Step 3: Orchestrate prefill-decode handoff

Implement the three-phase KV cache handoff within the custom router. Phase one calls prep_recv on the decode engine to reserve KV cache locations and obtain address information. Phase two calls remote_send on the prefill engine, passing the KV address information so it can compute and transfer the prefilled KV cache to the decode engine. Phase three calls start_generate on the decode engine to begin autoregressive token generation using the transferred KV cache.

Key considerations:

  • The prep_recv call returns kv_addr_info that must be forwarded to remote_send
  • The recv_rank parameter in remote_send identifies the target decode engine by its device ID start index
  • The begin and end parameters define the token range boundaries for the KV cache transfer
  • All three phases use the microserving protocol endpoints (PrepRecvRequest, RemoteSendRequest, StartGenerateRequest)

Step 4: Launch the orchestrated deployment

Start the complete multi-engine deployment by calling the serve function with the custom router type, specifying host and port for the router frontend, endpoint hosts and ports for each engine instance, GPU allocation per engine, and model configuration. The serve function launches the engine processes, initializes the router, and begins accepting client requests through the router frontend.

Key considerations:

  • The router_host and router_port define the client-facing endpoint
  • endpoint_hosts and endpoint_ports are lists specifying each engine's network binding
  • endpoint_num_gpus specifies the number of GPUs per engine for tensor parallelism
  • Prefix caching should be disabled when using disaggregated serving to avoid KV cache conflicts

Step 5: Handle preemption and recovery

Implement robust preemption handling in the router to manage scenarios where the decode engine cannot accommodate a request due to KV cache pressure. When the decode engine returns a preempt signal (finish_reason of "preempt"), the router must retry the full translate_request pipeline, restarting from the prep_recv phase. This ensures requests are eventually served even under heavy load, with the router transparently managing the retry loop.

Key considerations:

  • Preemption is signaled via finish_reason "preempt" in the streaming response
  • The router yields None to trigger a retry of the entire translate_request coroutine
  • Retry logic should include backoff to prevent thundering herd under sustained load
  • Monitor preemption frequency as a signal of decode engine capacity constraints

Execution Diagram

GitHub URL

Workflow Repository