Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Metrics Impl

From Leeroopedia


Overview

The file cpp/serve/metrics.cc provides the implementation of JSON serialization methods for the MLC-LLM serving engine metrics system. It implements the AsJSON() and related serialization methods declared in metrics.h for four metric structures: TimeCost, SpecDecodeMetrics, RequestMetrics, and EngineMetrics. All JSON serialization uses the picojson library to produce lightweight, dependency-free JSON output.

File Location

cpp/serve/metrics.cc

Dependencies

Include Purpose
metrics.h Header declaring the metric structures
tvm/runtime/logging.h TVM assertion macros (ICHECK_EQ)
<sstream> String stream for formatting Prometheus-style metric labels

Namespace

All implementations reside in mlc::llm::serve.

TimeCost::AsJSON

picojson::object TimeCost::AsJSON() const {
  picojson::object config;
  config["count"] = picojson::value(count);
  if (count != 0) {
    config["mean"] = picojson::value(sum / count);
  }
  return config;
}

Produces a JSON object with:

  • "count" -- the number of tracked events
  • "mean" -- the mean cost (only included when count is non-zero to avoid division by zero)

SpecDecodeMetrics::AsJSON

picojson::object SpecDecodeMetrics::AsJSON() const {
  picojson::object metrics;
  auto f_vector_to_array = [](const std::vector<int64_t>& vec) {
    picojson::array arr;
    for (int64_t v : vec) {
      arr.push_back(picojson::value(v));
    }
    return picojson::value(arr);
  };
  metrics["draft_count"] = f_vector_to_array(draft_count);
  metrics["accept_count"] = f_vector_to_array(accept_count);
  // ... computes accept_prob, accept_rate, accept_len per step
  return metrics;
}

This method serializes speculative decoding statistics. It computes three derived metric groups, each using Prometheus-style labels (e.g., accept_prob{step=0}):

Metric Group Computation Description
accept_prob accept_count[i] / draft_count[i] Acceptance probability at each speculation step
accept_rate accept_count[i] / accept_count[i-1] Conditional acceptance rate given acceptance at the previous step (starts from step 1)
accept_len Cumulative sum of accept_prob Expected number of accepted tokens up to each step

The method validates that draft_count and accept_count vectors have the same size using ICHECK_EQ.

RequestMetrics::AsJSON

picojson::object RequestMetrics::AsJSON() const {
  picojson::object metrics;
  metrics["prompt_tokens"] = picojson::value(prompt_tokens);
  metrics["completion_tokens"] = picojson::value(completion_tokens);
  metrics["prefill_tokens"] = picojson::value(prefill_tokens);
  metrics["decode_tokens"] = picojson::value(decode_tokens);
  metrics["jump_forward_tokens"] = picojson::value(jump_forward_tokens);
  // ... conditional throughput and latency fields
  return metrics;
}

Produces a comprehensive per-request metrics JSON including:

  • Token counts: prompt_tokens, completion_tokens, prefill_tokens, decode_tokens, jump_forward_tokens
  • Throughput (conditional on non-zero counts):
    • prefill_tokens_per_s -- prefill throughput
    • decode_tokens_per_s -- decode throughput
  • Latency:
    • end_to_end_latency_s -- total request duration
    • ttft_s -- time to first token
    • inter_token_latency_s -- average latency between tokens

RequestMetrics::AsUsageJSONStr

std::string RequestMetrics::AsUsageJSONStr(bool include_extra) const {
  picojson::object usage;
  usage["prompt_tokens"] = picojson::value(prompt_tokens);
  usage["completion_tokens"] = picojson::value(completion_tokens);
  usage["total_tokens"] = picojson::value(prompt_tokens + completion_tokens);
  if (include_extra) {
    usage["extra"] = picojson::value(this->AsJSON());
  }
  return picojson::value(usage).serialize();
}

Returns an OpenAI-compatible usage JSON string. When include_extra is true, the detailed metrics from AsJSON() are nested under an "extra" key.

EngineMetrics::AsJSON

The largest serialization method, producing the full engine-level metrics JSON. Key sections:

  1. Aggregate counters: engine_prefill_time_sum, engine_decode_time_sum, engine_jump_forward_time_sum, all token sum counters
  2. Throughput: prefill_tokens_per_s and decode_tokens_per_s (conditional on non-zero denominators)
  3. Last finished request: Embedded via last_finished_request.AsJSON()
  4. Speculative decoding: Embedded via spec_decode.AsJSON() when non-empty
  5. Batch-size-disaggregated timing: Uses a local lambda f_create_time_list to format decode_time_by_batch_size, draft_time_by_batch_size, and verify_time_by_batch_size with Prometheus-style labels such as mean{batch_size=4} and count{batch_size=4}
auto f_create_time_list = [](const std::vector<TimeCost>& time_list) {
    picojson::object result;
    for (size_t i = 1; i < time_list.size(); ++i) {
      const TimeCost& item = time_list[i];
      if (item.count == 0) continue;
      std::ostringstream label_mean;
      label_mean << "mean{batch_size=" << i << "}";
      double mean = item.sum / item.count;
      result[label_mean.str()] = picojson::value(mean);
      // ... also emits count
    }
    return picojson::value(result);
};

EngineMetrics::AsUsageJSONStr

std::string EngineMetrics::AsUsageJSONStr() const {
  picojson::object usage;
  usage["prompt_tokens"] = picojson::value(static_cast<int64_t>(0));
  usage["completion_tokens"] = picojson::value(static_cast<int64_t>(0));
  usage["total_tokens"] = picojson::value(static_cast<int64_t>(0));
  usage["extra"] = picojson::value(this->AsJSON());
  return picojson::value(usage).serialize();
}

Returns an OpenAI API-compatible usage JSON string with token counts set to zero (since engine-level metrics do not correspond to a single request). The actual engine metrics are embedded in the "extra" field.

EngineMetrics::Reset

void EngineMetrics::Reset() {
  engine_prefill_time_sum = 0.0;
  engine_decode_time_sum = 0.0;
  engine_jump_forward_time_sum = 0;
  prompt_tokens_sum = 0;
  completion_tokens_sum = 0;
  prefill_tokens_sum = 0;
  decode_tokens_sum = 0;
  jump_forward_tokens_sum = 0;
  last_finished_request.Reset();
  spec_decode.Reset();
  decode_time_by_batch_size.clear();
  draft_time_by_batch_size.clear();
  verify_time_by_batch_size.clear();
  decode_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
  draft_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
  verify_time_by_batch_size.resize(kEndFineGrainedTrackingBatchSize);
}

Resets all engine metrics to initial values. The batch-size timing vectors are cleared and then resized to kEndFineGrainedTrackingBatchSize (65), re-initializing with default-constructed TimeCost entries.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment