Implementation:Mlc ai Mlc llm Action Commons
Overview
File: cpp/serve/engine_actions/action_commons.h
Purpose: This header declares common utility functions shared across multiple EngineAction implementations in the MLC LLM serving engine. These functions handle lifecycle operations for requests -- including creation of engine actions, removal of requests from models, post-processing after action steps, preemption of running requests, and combined logit processing with token sampling.
Namespace: mlc::llm::serve
Header Dependencies
The file includes the following internal headers to access the core serving infrastructure:
| Header | Purpose |
|---|---|
tvm/ffi/container/array.h |
TVM Array container type |
../../tokenizers/tokenizers.h |
Tokenizer interface for logprob processing |
../draft_token_workspace_manager.h |
Draft token workspace management for speculative decoding |
../engine.h |
Core engine definitions |
../engine_state.h |
Engine state management |
../event_trace_recorder.h |
Event tracing infrastructure |
../model.h |
Model interface |
action.h |
Base EngineAction definitions |
Function Declarations
CreateEngineActions
Array<EngineAction> CreateEngineActions(
Array<Model> models, EngineConfig engine_config, std::vector<picojson::object> model_configs,
std::vector<ModelWorkspace> model_workspaces, LogitProcessor logit_processor, Sampler sampler,
DraftTokenWorkspaceManager draft_token_workspace_manager, Tokenizer tokenizer,
Optional<EventTraceRecorder> trace_recorder, FRequestStreamCallback request_stream_callback,
Device device);
Factory function that constructs the complete set of engine actions based on the provided engine configuration. This is the main entry point for setting up the action pipeline. It takes the full set of engine components -- models, configuration, logit processing, sampling, tokenization, and tracing -- and returns an array of EngineAction objects that the engine will execute in sequence.
RemoveRequestFromModel
void RemoveRequestFromModel(EngineState estate, int64_t req_internal_id,
const Array<Model>& models);
Removes a request identified by its internal ID from all models. This function updates the engine state accordingly after the removal. It is called when a request completes or is otherwise evicted from the serving pipeline.
ActionStepPostProcess
void ActionStepPostProcess(Array<Request> requests, EngineState estate, const Array<Model>& models,
const Tokenizer& tokenizer,
FRequestStreamCallback request_stream_callback,
int64_t max_single_sequence_length,
Optional<DraftTokenWorkspaceManager> draft_token_workspace_manager,
Optional<EventTraceRecorder> trace_recorder);
Performs post-processing after each engine action step. This function handles two primary responsibilities:
- Invokes the request stream callback to return newly generated tokens to the caller.
- Updates the engine state to reflect finished requests.
Important: This function may remove requests from the running_queue. The max_single_sequence_length parameter is used to determine whether a request has reached its maximum length and should be terminated.
PreemptLastRunningRequestStateEntry
RequestStateEntry PreemptLastRunningRequestStateEntry(
EngineState estate, const Array<Model>& models,
Optional<DraftTokenWorkspaceManager> draft_token_workspace_manager,
Optional<EventTraceRecorder> trace_recorder);
Preempts the last (lowest-priority) running request state entry from the running_queue. The behavior is as follows:
- If all entries of the selected request have been preempted, the request is removed from the running set.
- If the preempted request is not already in the waiting queue, it is added back to the waiting queue for later rescheduling.
- The
draft_token_workspace_managermust be provided when speculative decoding is enabled, to properly reclaim draft token workspace resources.
Returns the preempted RequestStateEntry.
ApplyLogitProcessorAndSample
std::pair<Tensor, std::vector<SampleResult>> ApplyLogitProcessorAndSample(
const LogitProcessor& logit_processor, const Sampler& sampler, const Tensor& logits,
const Array<GenerationConfig>& generation_cfg, const Array<String>& request_ids,
const Array<RequestModelState>& mstates, const std::vector<RandomGenerator*>& rngs,
const std::vector<int>& sample_indices, const Array<GenerationConfig>& child_generation_cfg,
const Array<String>& child_request_ids, const std::vector<int>& child_sample_indices);
Combines logit processing and token sampling into a single utility function. It takes both parent and child request configurations:
- Parent configurations are used to process the logits and normalize the probability distributions.
- Child configurations are used to sample the actual tokens.
When a request does not have children (i.e., no speculative decoding tree branching), the parent and child configurations are identical. The function returns a pair containing the processed logits tensor and a vector of SampleResult values.
Design Notes
- The file uses
using namespace tvm::runtime;to bring TVM runtime types into themlc::llm::servenamespace. - All functions accept
Optionalwrappers for components that may not be present in all configurations (e.g.,EventTraceRecorderandDraftTokenWorkspaceManager), allowing the same code paths to be used for both standard and speculative decoding modes. - The header is protected by an include guard:
MLC_LLM_SERVE_ENGINE_ACTIONS_ACTION_COMMONS_H_.