Implementation:Mlc ai Mlc llm Metrics Impl

Overview

The file cpp/serve/metrics.cc provides the implementation of JSON serialization methods for the MLC-LLM serving engine metrics system. It implements the AsJSON() and related serialization methods declared in metrics.h for four metric structures: TimeCost, SpecDecodeMetrics, RequestMetrics, and EngineMetrics. All JSON serialization uses the picojson library to produce lightweight, dependency-free JSON output.

File Location

cpp/serve/metrics.cc

Dependencies

Include	Purpose
`metrics.h`	Header declaring the metric structures
`tvm/runtime/logging.h`	TVM assertion macros (`ICHECK_EQ`)
`<sstream>`	String stream for formatting Prometheus-style metric labels

Namespace

All implementations reside in mlc::llm::serve.

TimeCost::AsJSON

picojson::object TimeCost::AsJSON() const {
  picojson::object config;
  config["count"] = picojson::value(count);
  if (count != 0) {
    config["mean"] = picojson::value(sum / count);
  }
  return config;
}

Produces a JSON object with:

"count" -- the number of tracked events
"mean" -- the mean cost (only included when count is non-zero to avoid division by zero)

SpecDecodeMetrics::AsJSON

picojson::object SpecDecodeMetrics::AsJSON() const {
  picojson::object metrics;
  auto f_vector_to_array = [](const std::vector<int64_t>& vec) {
    picojson::array arr;
    for (int64_t v : vec) {
      arr.push_back(picojson::value(v));
    }
    return picojson::value(arr);
  };
  metrics["draft_count"] = f_vector_to_array(draft_count);
  metrics["accept_count"] = f_vector_to_array(accept_count);
  // ... computes accept_prob, accept_rate, accept_len per step
  return metrics;
}

This method serializes speculative decoding statistics. It computes three derived metric groups, each using Prometheus-style labels (e.g., accept_prob{step=0}):

Metric Group	Computation	Description
`accept_prob`	`accept_count[i] / draft_count[i]`	Acceptance probability at each speculation step
`accept_rate`	`accept_count[i] / accept_count[i-1]`	Conditional acceptance rate given acceptance at the previous step (starts from step 1)
`accept_len`	Cumulative sum of `accept_prob`	Expected number of accepted tokens up to each step

The method validates that draft_count and accept_count vectors have the same size using ICHECK_EQ.

RequestMetrics::AsJSON

picojson::object RequestMetrics::AsJSON() const {
  picojson::object metrics;
  metrics["prompt_tokens"] = picojson::value(prompt_tokens);
  metrics["completion_tokens"] = picojson::value(completion_tokens);
  metrics["prefill_tokens"] = picojson::value(prefill_tokens);
  metrics["decode_tokens"] = picojson::value(decode_tokens);
  metrics["jump_forward_tokens"] = picojson::value(jump_forward_tokens);
  // ... conditional throughput and latency fields
  return metrics;
}

Produces a comprehensive per-request metrics JSON including:

Token counts: prompt_tokens, completion_tokens, prefill_tokens, decode_tokens, jump_forward_tokens
Throughput (conditional on non-zero counts):
- prefill_tokens_per_s -- prefill throughput
- decode_tokens_per_s -- decode throughput
Latency:
- end_to_end_latency_s -- total request duration
- ttft_s -- time to first token
- inter_token_latency_s -- average latency between tokens

RequestMetrics::AsUsageJSONStr

std::string RequestMetrics::AsUsageJSONStr(bool include_extra) const {
  picojson::object usage;
  usage["prompt_tokens"] = picojson::value(prompt_tokens);
  usage["completion_tokens"] = picojson::value(completion_tokens);
  usage["total_tokens"] = picojson::value(prompt_tokens + completion_tokens);
  if (include_extra) {
    usage["extra"] = picojson::value(this->AsJSON());
  }
  return picojson::value(usage).serialize();
}

Returns an OpenAI-compatible usage JSON string. When include_extra is true, the detailed metrics from AsJSON() are nested under an "extra" key.

EngineMetrics::AsJSON

The largest serialization method, producing the full engine-level metrics JSON. Key sections:

Aggregate counters: engine_prefill_time_sum, engine_decode_time_sum, engine_jump_forward_time_sum, all token sum counters
Throughput: prefill_tokens_per_s and decode_tokens_per_s (conditional on non-zero denominators)
Last finished request: Embedded via last_finished_request.AsJSON()
Speculative decoding: Embedded via spec_decode.AsJSON() when non-empty
Batch-size-disaggregated timing: Uses a local lambda f_create_time_list to format decode_time_by_batch_size, draft_time_by_batch_size, and verify_time_by_batch_size with Prometheus-style labels such as mean{batch_size=4} and count{batch_size=4}

auto f_create_time_list = [](const std::vector<TimeCost>& time_list) {
    picojson::object result;
    for (size_t i = 1; i < time_list.size(); ++i) {
      const TimeCost& item = time_list[i];
      if (item.count == 0) continue;
      std::ostringstream label_mean;
      label_mean << "mean{batch_size=" << i << "}";
      double mean = item.sum / item.count;
      result[label_mean.str()] = picojson::value(mean);
      // ... also emits count
    }
    return picojson::value(result);
};

EngineMetrics::AsUsageJSONStr

std::string EngineMetrics::AsUsageJSONStr() const {
  picojson::object usage;
  usage["prompt_tokens"] = picojson::value(static_cast<int64_t>(0));
  usage["completion_tokens"] = picojson::value(static_cast<int64_t>(0));
  usage["total_tokens"] = picojson::value(static_cast<int64_t>(0));
  usage["extra"] = picojson::value(this->AsJSON());
  return picojson::value(usage).serialize();
}

Returns an OpenAI API-compatible usage JSON string with token counts set to zero (since engine-level metrics do not correspond to a single request). The actual engine metrics are embedded in the "extra" field.

EngineMetrics::Reset

void EngineMetrics::Reset() {
  engine_prefill_time_sum = 0.0;
  engine_decode_time_sum = 0.0;
  engine_jump_forward_time_sum = 0;
  prompt_tokens_sum = 0;
  completion_tokens_sum = 0;
  prefill_tokens_sum = 0;
  decode_tokens_sum = 0;
  jump_forward_tokens_sum = 0;
  last_finished_request.Reset();
  spec_decode.Reset();
  decode_time_by_batch_size.clear();
  draft_time_by_batch_size.clear();
  verify_time_by_batch_size.clear();
  decode_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
  draft_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
  verify_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
}

Resets all engine metrics to initial values. The batch-size timing vectors are cleared and then resized to kEndFineGrainedTrackingBatchSize (65), re-initializing with default-constructed TimeCost entries.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment