Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Mlc ai Mlc llm Python Engine Inference

From Leeroopedia


Knowledge Sources
Domains LLMs, Python_API, Inference, OpenAI_Compatibility
Last Updated 2026-02-09 20:00 GMT

Overview

End-to-end process for running LLM inference programmatically using MLC-LLM's Python engine API with OpenAI-compatible chat completions and text completions interfaces.

Description

This workflow demonstrates how to use MLC-LLM's Python-native engine classes (MLCEngine for synchronous usage and AsyncMLCEngine for asynchronous batch processing) to perform LLM inference directly within Python applications. The API mirrors the OpenAI client library, providing chat completions and text completions with streaming support. The engine handles model loading, JIT compilation, KV cache management, and token sampling internally, exposing a high-level interface suitable for application development, scripting, and integration into larger Python systems.

Key outputs:

  • Generated text responses via OpenAI-compatible Python API
  • Streaming token-by-token output for real-time applications
  • Batch processing capability for high-throughput scenarios

Usage

Execute this workflow when you need to integrate LLM inference directly into a Python application without running a separate server process. This is the preferred approach for scripts, notebooks, batch processing pipelines, and any scenario where the overhead of HTTP communication is unnecessary. Use MLCEngine for simple synchronous usage and AsyncMLCEngine when handling multiple concurrent requests with continuous batching.

Execution Steps

Step 1: Initialize the engine

Create an MLCEngine (synchronous) or AsyncMLCEngine (asynchronous) instance by specifying the model path or HuggingFace identifier. The engine automatically downloads pre-quantized weights if using an HF:// path, performs JIT compilation of the model library if needed, and allocates GPU resources. Engine configuration controls execution mode (local, interactive, or server), GPU memory utilization, and tensor parallelism for multi-GPU setups.

Key considerations:

  • HF:// paths trigger automatic download and caching of model weights
  • Local paths require model weights and mlc-chat-config.json to be present
  • An optional model_lib parameter points to a pre-compiled library to skip JIT compilation
  • Engine modes control KV cache allocation: local (conservative), interactive (single request), server (maximum)

Step 2: Create chat completion requests

Construct chat completion requests using the OpenAI-compatible messages format. Each request contains a list of messages with roles (system, user, assistant) and content, along with generation parameters such as temperature, top_p, max_tokens, and stop sequences. Requests can be configured for either streaming or non-streaming response delivery.

Key considerations:

  • Message format follows the OpenAI Chat Completions API specification exactly
  • System messages set the behavior and personality of the model
  • Generation parameters control randomness and output length
  • The model parameter must match the model loaded in the engine

Step 3: Process streaming responses

For streaming mode, iterate over the response generator to receive token-by-token output. Each chunk contains a delta with the newly generated content, enabling real-time display of model output. For non-streaming mode, the complete response is returned as a single object after generation completes. Both modes return usage statistics including prompt and completion token counts.

Key considerations:

  • Streaming provides lower time-to-first-token for interactive applications
  • Non-streaming is simpler for batch processing where complete responses are needed
  • The finish_reason field indicates why generation stopped (stop token, length limit, or tool call)

Step 4: Handle concurrent requests with AsyncMLCEngine

For high-throughput scenarios, use AsyncMLCEngine to submit multiple requests concurrently. The engine applies continuous batching to process requests in parallel on the GPU, maximizing hardware utilization. Async generators yield streaming responses for each request independently, enabling efficient batch processing patterns.

Key considerations:

  • AsyncMLCEngine requires an asyncio event loop
  • Continuous batching dynamically schedules requests for optimal GPU utilization
  • Server mode is recommended for maximum concurrency
  • Each request is processed independently and can have different generation parameters

Step 5: Terminate the engine

Explicitly terminate the engine when inference is complete to release GPU resources, close background threads, and clean up allocated memory. This is especially important in long-running applications and notebook environments where GPU memory management is critical.

Key considerations:

  • Always call engine.terminate() when finished to avoid GPU memory leaks
  • The engine cannot be reused after termination; create a new instance if needed
  • In notebook environments, failing to terminate may prevent other GPU workloads from running

Execution Diagram

GitHub URL

Workflow Repository