Implementation:Mlc ai Mlc llm Debug Entrypoints

Overview

python/mlc_llm/serve/entrypoints/debug_entrypoints.py defines a set of HTTP debug endpoints for the MLC LLM server. These endpoints expose internal engine functionality for debugging and profiling, including event trace dumping, CUDA profiler control, engine metrics retrieval, and engine reset. All endpoints are mounted on the /debug path prefix using a FastAPI APIRouter.

Location

File: python/mlc_llm/serve/entrypoints/debug_entrypoints.py
Module: mlc_llm.serve.entrypoints.debug_entrypoints
Lines: 132

Router Setup

app = fastapi.APIRouter()

All debug endpoints are registered on this APIRouter instance, which is typically mounted under the /debug prefix by the main server application.

Endpoints

POST /debug/dump_event_trace

@app.post("/debug/dump_event_trace")
async def debug_dump_event_trace(request: fastapi.Request):

Returns the recorded events in Chrome Trace Event Format as a JSON response.

Request payload:

{"model": "Llama-2-7b-chat-hf-q0f16"}

Processing flow:

Reads the raw request body and decodes it as UTF-8.
Parses the JSON payload and validates the "model" field is present.
Looks up the async engine for the specified model via ServerContext.current().
Returns an error if the model is not served or tracing is not enabled.
Calls async_engine.state.trace_recorder.dump_json() and returns the parsed JSON.

Error responses:

400 Bad Request if the JSON is invalid, the "model" field is missing, the model is not served, or tracing is not enabled.

POST /debug/cuda_profiler_start

@app.post("/debug/cuda_profiler_start")
async def debug_cuda_profiler_start(_request: fastapi.Request):

Starts the CUDA profiler for the engine. Since the CUDA profiler is process-wide, it only calls the function on the first model's engine (using a for ... break pattern).

Implementation:

for model in server_context.get_model_list():
    async_engine = server_context.get_engine(model)
    async_engine._debug_call_func_on_all_worker("mlc.debug_cuda_profiler_start")
    break

POST /debug/cuda_profiler_stop

@app.post("/debug/cuda_profiler_stop")
async def debug_cuda_profiler_stop(_request: fastapi.Request):

Stops the CUDA profiler. Uses the same pattern as the start endpoint, calling "mlc.debug_cuda_profiler_stop" on all workers of the first model's engine.

POST /debug/dump_engine_metrics

@app.post("/debug/dump_engine_metrics")
async def debug_dump_engine_metrics(request: fastapi.Request):

Returns the engine metrics for debugging purposes.

Processing flow:

Reads and parses the raw request body as JSON.
Extracts the optional "model" field (defaults to None).
Retrieves the engine and awaits async_engine.metrics().
Returns the metrics result.

Note that unlike other endpoints, this one uses request_dict.get("model", None) and does not explicitly validate the model field's presence.

POST /debug/reset_engine

@app.post("/debug/reset_engine")
async def debug_reset_engine_stats(request: fastapi.Request):

Resets the engine, cleaning up all running data and metrics.

Processing flow:

Reads and parses the raw request body as JSON.
Validates the "model" field is present.
Retrieves the async engine for the specified model.
Calls async_engine.reset() to clear all state.

Error Handling Pattern

All endpoints that accept a model name follow a consistent error handling pattern:

request_raw_data = await request.body()
request_json_str = request_raw_data.decode("utf-8")
try:
    request_dict = json.loads(request_json_str)
except json.JSONDecodeError:
    return error_protocol.create_error_response(
        HTTPStatus.BAD_REQUEST, message=f"Invalid request {request_json_str}"
    )
if "model" not in request_dict:
    return error_protocol.create_error_response(
        HTTPStatus.BAD_REQUEST, message=f"Invalid request {request_json_str}"
    )

Requests are processed as raw bytes rather than using FastAPI's built-in request parsing, giving full control over JSON decoding and error responses.

Dependencies

fastapi: For the APIRouter and Request objects.
json: For JSON parsing and decoding.
http.HTTPStatus: For HTTP status code constants.
mlc_llm.protocol.error_protocol: For creating standardized error responses.
mlc_llm.serve.server.ServerContext: For accessing the server's engine registry and model list.

Summary of Endpoints

Endpoint	Method	Purpose	Model Required
`/debug/dump_event_trace`	POST	Dump event traces in Chrome Trace Format	Yes
`/debug/cuda_profiler_start`	POST	Start CUDA profiler	No
`/debug/cuda_profiler_stop`	POST	Stop CUDA profiler	No
`/debug/dump_engine_metrics`	POST	Retrieve engine metrics	Optional
`/debug/reset_engine`	POST	Reset engine state and metrics	Yes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment