Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Mlc llm Orchestrated Deployment Launch

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Distributed_Serving
Last Updated 2026-02-09 00:00 GMT

Overview

Orchestrated deployment launch is the practice of starting a coordinated set of backend engine endpoints alongside a unified API gateway (router) in a single operation, providing a ready-to-use disaggregated serving system behind a standard HTTP interface.

Description

Deploying a disaggregated LLM serving system requires coordinating multiple independent components: engine processes (each bound to specific GPUs and network ports), a shared GPU communication layer (NVSHMEM), and a front-facing API server that exposes a standard interface (OpenAI-compatible completions). Orchestrated deployment launch encapsulates all of this coordination into a single function call that:

  1. Instantiates the router: Creates a Router object (or a custom subclass), which in turn spawns all backend engine servers, initializes NVSHMEM, and sets up load tracking.
  2. Registers API endpoints: Creates a FastAPI application with an OpenAI-compatible /v1/completions endpoint that delegates to the router's handle_completion method for both streaming and non-streaming responses.
  3. Handles request lifecycle: The registered endpoint manages request ID generation, streaming response formatting (server-sent events), non-streaming response aggregation, client disconnection detection, and error handling.
  4. Starts the HTTP server: Launches a uvicorn server on the specified host and port, making the entire system accessible via a single URL.

This approach follows the Facade pattern: a complex multi-component system is exposed through a simple, single-entry-point interface. Users interact with a standard OpenAI completions API without needing to know about the underlying engine topology, NVSHMEM configuration, or microserving protocol.

Usage

Use orchestrated deployment launch when:

  • You want to deploy a complete disaggregated or round-robin serving system with a single function call.
  • You need an OpenAI-compatible API gateway in front of multiple engine endpoints.
  • You are deploying in production and want the system to handle all coordination (engine startup, NVSHMEM initialization, API routing) automatically.
  • You want to use a custom Router subclass by passing a router_type parameter, enabling custom routing logic while keeping the deployment infrastructure standard.

Theoretical Basis

Facade Pattern for Distributed Systems

The orchestrated launch implements a facade over the following component hierarchy:

serve()
  |
  +-- Router.__init__()
  |     +-- NVSHMEM UID generation
  |     +-- PopenServer[0].start()  (prefill engine)
  |     +-- PopenServer[1].start()  (decode engine)
  |     +-- PopenServer[N].start()  (additional decode engines)
  |     +-- Tokenizer initialization
  |
  +-- FastAPI Application
  |     +-- POST /v1/completions
  |           +-- router.handle_completion()
  |                 +-- router.translate_request()
  |                       +-- prep_recv -> remote_send -> start_generate
  |
  +-- uvicorn.run()

Request Lifecycle in the Gateway

When a completion request arrives at the gateway, it follows this lifecycle:

1. Generate unique request_id: "cmpl-{uuid}"
2. If streaming:
     a. Create async generator from router.handle_completion()
     b. Eagerly fetch the first chunk (to catch errors in this scope)
     c. Return StreamingResponse with "data: {json}\n\n" formatting
     d. Append "data: [DONE]\n\n" sentinel when generator exhausts
3. If non-streaming:
     a. Iterate over all response chunks from router.handle_completion()
     b. Accumulate output text and finish reasons per choice
     c. Check for client disconnection on each iteration
     d. Return aggregated CompletionResponse as JSON

Extensibility via Router Type Injection

The serve() function accepts a router_type parameter (defaulting to the base Router class). This enables dependency injection of custom router implementations:

serve(model=..., router_type=MyCustomRouter)
    -> MyCustomRouter.__init__()       # custom initialization
    -> MyCustomRouter.translate_request()  # custom routing logic

This pattern allows the deployment infrastructure to remain stable while the routing strategy is swapped.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment