Implementation:Mlc ai Mlc llm Debug Entrypoints
Overview
python/mlc_llm/serve/entrypoints/debug_entrypoints.py defines a set of HTTP debug endpoints for the MLC LLM server. These endpoints expose internal engine functionality for debugging and profiling, including event trace dumping, CUDA profiler control, engine metrics retrieval, and engine reset. All endpoints are mounted on the /debug path prefix using a FastAPI APIRouter.
Location
- File:
python/mlc_llm/serve/entrypoints/debug_entrypoints.py - Module:
mlc_llm.serve.entrypoints.debug_entrypoints - Lines: 132
Router Setup
app = fastapi.APIRouter()
All debug endpoints are registered on this APIRouter instance, which is typically mounted under the /debug prefix by the main server application.
Endpoints
POST /debug/dump_event_trace
@app.post("/debug/dump_event_trace")
async def debug_dump_event_trace(request: fastapi.Request):
Returns the recorded events in Chrome Trace Event Format as a JSON response.
Request payload:
{"model": "Llama-2-7b-chat-hf-q0f16"}
Processing flow:
- Reads the raw request body and decodes it as UTF-8.
- Parses the JSON payload and validates the
"model"field is present. - Looks up the async engine for the specified model via
ServerContext.current(). - Returns an error if the model is not served or tracing is not enabled.
- Calls
async_engine.state.trace_recorder.dump_json()and returns the parsed JSON.
Error responses:
400 Bad Requestif the JSON is invalid, the"model"field is missing, the model is not served, or tracing is not enabled.
POST /debug/cuda_profiler_start
@app.post("/debug/cuda_profiler_start")
async def debug_cuda_profiler_start(_request: fastapi.Request):
Starts the CUDA profiler for the engine. Since the CUDA profiler is process-wide, it only calls the function on the first model's engine (using a for ... break pattern).
Implementation:
for model in server_context.get_model_list():
async_engine = server_context.get_engine(model)
async_engine._debug_call_func_on_all_worker("mlc.debug_cuda_profiler_start")
break
POST /debug/cuda_profiler_stop
@app.post("/debug/cuda_profiler_stop")
async def debug_cuda_profiler_stop(_request: fastapi.Request):
Stops the CUDA profiler. Uses the same pattern as the start endpoint, calling "mlc.debug_cuda_profiler_stop" on all workers of the first model's engine.
POST /debug/dump_engine_metrics
@app.post("/debug/dump_engine_metrics")
async def debug_dump_engine_metrics(request: fastapi.Request):
Returns the engine metrics for debugging purposes.
Processing flow:
- Reads and parses the raw request body as JSON.
- Extracts the optional
"model"field (defaults toNone). - Retrieves the engine and awaits
async_engine.metrics(). - Returns the metrics result.
Note that unlike other endpoints, this one uses request_dict.get("model", None) and does not explicitly validate the model field's presence.
POST /debug/reset_engine
@app.post("/debug/reset_engine")
async def debug_reset_engine_stats(request: fastapi.Request):
Resets the engine, cleaning up all running data and metrics.
Processing flow:
- Reads and parses the raw request body as JSON.
- Validates the
"model"field is present. - Retrieves the async engine for the specified model.
- Calls
async_engine.reset()to clear all state.
Error Handling Pattern
All endpoints that accept a model name follow a consistent error handling pattern:
request_raw_data = await request.body()
request_json_str = request_raw_data.decode("utf-8")
try:
request_dict = json.loads(request_json_str)
except json.JSONDecodeError:
return error_protocol.create_error_response(
HTTPStatus.BAD_REQUEST, message=f"Invalid request {request_json_str}"
)
if "model" not in request_dict:
return error_protocol.create_error_response(
HTTPStatus.BAD_REQUEST, message=f"Invalid request {request_json_str}"
)
Requests are processed as raw bytes rather than using FastAPI's built-in request parsing, giving full control over JSON decoding and error responses.
Dependencies
- fastapi: For the
APIRouterandRequestobjects. - json: For JSON parsing and decoding.
- http.HTTPStatus: For HTTP status code constants.
- mlc_llm.protocol.error_protocol: For creating standardized error responses.
- mlc_llm.serve.server.ServerContext: For accessing the server's engine registry and model list.
Summary of Endpoints
| Endpoint | Method | Purpose | Model Required |
|---|---|---|---|
/debug/dump_event_trace |
POST | Dump event traces in Chrome Trace Format | Yes |
/debug/cuda_profiler_start |
POST | Start CUDA profiler | No |
/debug/cuda_profiler_stop |
POST | Stop CUDA profiler | No |
/debug/dump_engine_metrics |
POST | Retrieve engine metrics | Optional |
/debug/reset_engine |
POST | Reset engine state and metrics | Yes |