Implementation:Mlc ai Mlc llm Metrics Impl
Overview
The file cpp/serve/metrics.cc provides the implementation of JSON serialization methods for the MLC-LLM serving engine metrics system. It implements the AsJSON() and related serialization methods declared in metrics.h for four metric structures: TimeCost, SpecDecodeMetrics, RequestMetrics, and EngineMetrics. All JSON serialization uses the picojson library to produce lightweight, dependency-free JSON output.
File Location
cpp/serve/metrics.cc
Dependencies
| Include | Purpose |
|---|---|
metrics.h |
Header declaring the metric structures |
tvm/runtime/logging.h |
TVM assertion macros (ICHECK_EQ)
|
<sstream> |
String stream for formatting Prometheus-style metric labels |
Namespace
All implementations reside in mlc::llm::serve.
TimeCost::AsJSON
picojson::object TimeCost::AsJSON() const {
picojson::object config;
config["count"] = picojson::value(count);
if (count != 0) {
config["mean"] = picojson::value(sum / count);
}
return config;
}
Produces a JSON object with:
"count"-- the number of tracked events"mean"-- the mean cost (only included whencountis non-zero to avoid division by zero)
SpecDecodeMetrics::AsJSON
picojson::object SpecDecodeMetrics::AsJSON() const {
picojson::object metrics;
auto f_vector_to_array = [](const std::vector<int64_t>& vec) {
picojson::array arr;
for (int64_t v : vec) {
arr.push_back(picojson::value(v));
}
return picojson::value(arr);
};
metrics["draft_count"] = f_vector_to_array(draft_count);
metrics["accept_count"] = f_vector_to_array(accept_count);
// ... computes accept_prob, accept_rate, accept_len per step
return metrics;
}
This method serializes speculative decoding statistics. It computes three derived metric groups, each using Prometheus-style labels (e.g., accept_prob{step=0}):
| Metric Group | Computation | Description |
|---|---|---|
accept_prob |
accept_count[i] / draft_count[i] |
Acceptance probability at each speculation step |
accept_rate |
accept_count[i] / accept_count[i-1] |
Conditional acceptance rate given acceptance at the previous step (starts from step 1) |
accept_len |
Cumulative sum of accept_prob |
Expected number of accepted tokens up to each step |
The method validates that draft_count and accept_count vectors have the same size using ICHECK_EQ.
RequestMetrics::AsJSON
picojson::object RequestMetrics::AsJSON() const {
picojson::object metrics;
metrics["prompt_tokens"] = picojson::value(prompt_tokens);
metrics["completion_tokens"] = picojson::value(completion_tokens);
metrics["prefill_tokens"] = picojson::value(prefill_tokens);
metrics["decode_tokens"] = picojson::value(decode_tokens);
metrics["jump_forward_tokens"] = picojson::value(jump_forward_tokens);
// ... conditional throughput and latency fields
return metrics;
}
Produces a comprehensive per-request metrics JSON including:
- Token counts:
prompt_tokens,completion_tokens,prefill_tokens,decode_tokens,jump_forward_tokens - Throughput (conditional on non-zero counts):
prefill_tokens_per_s-- prefill throughputdecode_tokens_per_s-- decode throughput
- Latency:
end_to_end_latency_s-- total request durationttft_s-- time to first tokeninter_token_latency_s-- average latency between tokens
RequestMetrics::AsUsageJSONStr
std::string RequestMetrics::AsUsageJSONStr(bool include_extra) const {
picojson::object usage;
usage["prompt_tokens"] = picojson::value(prompt_tokens);
usage["completion_tokens"] = picojson::value(completion_tokens);
usage["total_tokens"] = picojson::value(prompt_tokens + completion_tokens);
if (include_extra) {
usage["extra"] = picojson::value(this->AsJSON());
}
return picojson::value(usage).serialize();
}
Returns an OpenAI-compatible usage JSON string. When include_extra is true, the detailed metrics from AsJSON() are nested under an "extra" key.
EngineMetrics::AsJSON
The largest serialization method, producing the full engine-level metrics JSON. Key sections:
- Aggregate counters:
engine_prefill_time_sum,engine_decode_time_sum,engine_jump_forward_time_sum, all token sum counters - Throughput:
prefill_tokens_per_sanddecode_tokens_per_s(conditional on non-zero denominators) - Last finished request: Embedded via
last_finished_request.AsJSON() - Speculative decoding: Embedded via
spec_decode.AsJSON()when non-empty - Batch-size-disaggregated timing: Uses a local lambda
f_create_time_listto formatdecode_time_by_batch_size,draft_time_by_batch_size, andverify_time_by_batch_sizewith Prometheus-style labels such asmean{batch_size=4}andcount{batch_size=4}
auto f_create_time_list = [](const std::vector<TimeCost>& time_list) {
picojson::object result;
for (size_t i = 1; i < time_list.size(); ++i) {
const TimeCost& item = time_list[i];
if (item.count == 0) continue;
std::ostringstream label_mean;
label_mean << "mean{batch_size=" << i << "}";
double mean = item.sum / item.count;
result[label_mean.str()] = picojson::value(mean);
// ... also emits count
}
return picojson::value(result);
};
EngineMetrics::AsUsageJSONStr
std::string EngineMetrics::AsUsageJSONStr() const {
picojson::object usage;
usage["prompt_tokens"] = picojson::value(static_cast<int64_t>(0));
usage["completion_tokens"] = picojson::value(static_cast<int64_t>(0));
usage["total_tokens"] = picojson::value(static_cast<int64_t>(0));
usage["extra"] = picojson::value(this->AsJSON());
return picojson::value(usage).serialize();
}
Returns an OpenAI API-compatible usage JSON string with token counts set to zero (since engine-level metrics do not correspond to a single request). The actual engine metrics are embedded in the "extra" field.
EngineMetrics::Reset
void EngineMetrics::Reset() {
engine_prefill_time_sum = 0.0;
engine_decode_time_sum = 0.0;
engine_jump_forward_time_sum = 0;
prompt_tokens_sum = 0;
completion_tokens_sum = 0;
prefill_tokens_sum = 0;
decode_tokens_sum = 0;
jump_forward_tokens_sum = 0;
last_finished_request.Reset();
spec_decode.Reset();
decode_time_by_batch_size.clear();
draft_time_by_batch_size.clear();
verify_time_by_batch_size.clear();
decode_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
draft_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
verify_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
}
Resets all engine metrics to initial values. The batch-size timing vectors are cleared and then resized to kEndFineGrainedTrackingBatchSize (65), re-initializing with default-constructed TimeCost entries.
See Also
- Metrics Header -- Declarations of all metric structures and inline update methods
- Request State Implementation -- Request state management that feeds metrics data
- Request Header -- Request definition carrying generation configuration