Implementation:Mlc ai Mlc llm Request State Impl

Overview

The file cpp/serve/request_state.cc implements the runtime state management for requests in the MLC-LLM serving engine. It provides constructors and methods for three key classes: RequestModelState, RequestStateEntry, and RequestState. These classes manage the mutable runtime data associated with each request as it progresses through prefill, decode, and output generation stages. The file also implements the stream output assembly logic and the multi-case finish detection for request completion.

File Location

cpp/serve/request_state.cc

Dependencies

The implementation includes request_state.h and resides in the mlc::llm::serve namespace.

TVM FFI Registration

TVM_FFI_STATIC_INIT_BLOCK() {
  RequestModelStateNode::RegisterReflection();
  RequestStateEntryNode::RegisterReflection();
  RequestStateNode::RegisterReflection();
}

All three node types are registered with the TVM FFI reflection system at static initialization time.

Class: RequestModelState

Constructor

RequestModelState::RequestModelState(
    Request request, int model_id, int64_t internal_id, Array<Data> inputs,
    const std::optional<xgrammar::CompiledGrammar>& compiled_grammar);

Creates per-model state for a request. If a compiled_grammar is provided, a xgrammar::GrammarMatcher is instantiated with a rollback limit of 10. The grammar matcher enforces structured output constraints (e.g., JSON schema) during generation.

GetInputLength

int RequestModelStateNode::GetInputLength() const {
  int total_length = 0;
  for (Data input : inputs) {
    total_length += input->GetLength();
  }
  return total_length;
}

Sums the lengths of all input data items (which may include multi-modal inputs).

RequireNextTokenBitmask / GetNextTokenBitmask

bool RequestModelStateNode::RequireNextTokenBitmask() { return grammar_matcher.has_value(); }
void RequestModelStateNode::GetNextTokenBitmask(DLTensor* bitmask) {
  ICHECK(grammar_matcher.has_value());
  grammar_matcher->GetNextTokenBitmask(bitmask);
}

When a grammar matcher is active, these methods allow the logit processor to obtain a bitmask of valid next tokens according to the grammar constraints.

CommitToken

void RequestModelStateNode::CommitToken(SampleResult sampled_token) {
  committed_tokens.push_back(std::move(sampled_token));
  appeared_token_ids[sampled_token.GetTokenId()] += 1;
  ++num_tokens_for_next_decode;
  if (grammar_matcher) {
    bool accepted = grammar_matcher->AcceptToken(sampled_token.GetTokenId());
    ICHECK(accepted) << "Token id " << sampled_token.GetTokenId()
                     << " is not accepted by the grammar state matcher.";
  }
}

Commits a sampled token to the sequence. This method:

Appends the token to the committed tokens list
Updates the appeared-token frequency map (used for repetition/frequency penalties)
Increments the decode counter
Advances the grammar matcher state (with an assertion that the token is valid)

RollbackTokens

void RequestModelStateNode::RollbackTokens(int count) {
  ICHECK(count <= static_cast<int>(committed_tokens.size()));
  for (int i = 0; i < count; ++i) {
    auto it = appeared_token_ids.find(committed_tokens.back().GetTokenId());
    CHECK(it != appeared_token_ids.end());
    if (--it->second == 0) {
      appeared_token_ids.erase(it);
    }
    committed_tokens.pop_back();
    if (grammar_matcher) {
      grammar_matcher->Rollback(1);
    }
  }
}

Removes the last count tokens, reversing all state changes: the appeared-token map is decremented (and entries removed when reaching zero), tokens are popped from the committed list, and the grammar matcher is rolled back.

Draft Token Management

void RequestModelStateNode::AddDraftToken(SampleResult sampled_token, int draft_token_slot,
                                          int64_t parent_idx);
void RequestModelStateNode::RemoveAllDraftTokens(std::vector<int>* removed_draft_token_slots);

AddDraftToken appends a draft token along with its slot index and parent index in the draft token tree. It also maintains a draft_token_first_child_idx array for tree traversal during speculative decoding verification.

RemoveAllDraftTokens clears all draft tokens and optionally returns the deduplicated list of draft token slots that were freed (using an unordered_set for deduplication).

Class: RequestActionPostProcWorkspace

GetStreamOutput

RequestStreamOutput RequestActionPostProcWorkspace::GetStreamOutput() {
  for (const RequestStreamOutput& stream_output : stream_outputs) {
    if (stream_output->unpacked) {
      return stream_output;
    }
  }
  // ... creates a new aggregated stream output
}

Returns the first unpacked stream output if one exists. Otherwise, it creates a new aggregated RequestStreamOutput by combining the delta token IDs, log-probability strings, finish reasons, and extra prefix strings from all existing stream outputs. The new output is appended to the list and returned.

Class: RequestStateEntry

Constructor

RequestStateEntry::RequestStateEntry(
    Request request, int num_models, int64_t internal_id, int rng_seed,
    const std::vector<std::string>& token_table,
    const std::optional<xgrammar::CompiledGrammar>& compiled_grammar, int parent_idx);

Creates a request state entry with:

One RequestModelState per model (for multi-model serving or speculative decoding)
A random number generator seeded with rng_seed
A StopStrHandler initialized with stop strings from the generation config (unless ignore_eos is set in debug config)
Input data is only attached to the model states when parent_idx == -1 (i.e., this is a root entry, not a forked child)
Initial status set to kPending

GetDeltaRequestReturn

This is the most complex method in the file, implementing six cases of finish detection for streaming output:

void RequestStateEntryNode::GetDeltaRequestReturn(const Tokenizer& tokenizer,
                                                  int64_t max_single_sequence_length,
                                                  RequestStreamOutput* delta_stream_output,
                                                  int idx);

The method processes committed tokens starting from next_callback_token_pos and applies the following checks in order:

Case	Condition	Finish Reason
1	No new tokens and no extra prefix string	(no output, early return)
2	Stop string matched by `StopStrHandler`	`"stop"`
3	Stop token ID found in output (unless `ignore_eos` is set)	`"stop"`
4	Grammar matcher is terminated (and no stop token detected)	`"stop"`
5	`max_tokens` limit reached	`"length"`
6	Total prompt + completion exceeds `max_single_sequence_length`	`"length"`

For Case 3, when a stop token is detected, all tokens from the stop token onward are erased from the delta output. For Case 4, only the last token (which triggered grammar termination) is popped.

Class: RequestState

Constructor

RequestState::RequestState(std::vector<RequestStateEntry> entries, int num_response,
                           std::chrono::high_resolution_clock::time_point add_time_point);

Constructs the top-level request state from a vector of state entries. It:

Initializes prompt token metrics from the first entry's request
Records the add time point for latency tracking
Constructs the initial RequestStreamOutput with the appropriate number of response slots
Optionally initializes log-probability tracking based on generation_cfg->logprobs
Marks the initial stream output as unpacked = true for direct access
Stores the stream output in the postproc_states workspace

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

File Location

Dependencies

TVM FFI Registration

Class: RequestModelState

Constructor

GetInputLength

RequireNextTokenBitmask / GetNextTokenBitmask

CommitToken

RollbackTokens

Draft Token Management

Class: RequestActionPostProcWorkspace

GetStreamOutput

Class: RequestStateEntry

Constructor

GetDeltaRequestReturn

Class: RequestState

Constructor

See Also

Page Connections