Implementation:Mlc ai Mlc llm Batch Prefill Base
Overview
File: cpp/serve/engine_actions/batch_prefill_base.h
Purpose: Defines the abstract base class BatchPrefillBaseActionObj for prefill engine actions. Prefill is the initial phase of LLM inference where the model processes the full input prompt before generating tokens. This base class provides the common infrastructure for selecting requests from the waiting queue, chunking input data, managing prefix cache matching, and updating request states after sampling. Concrete subclasses implement the actual prefill execution and prefix cache matching strategy.
Namespace: mlc::llm::serve
Struct: PrefillInput
struct PrefillInput {
RequestStateEntry rsentry;
int max_prefill_length = 0;
int num_child_to_activate = 0;
bool is_decode = false;
};
A protected inner structure that bundles a request state entry with its prefill parameters:
| Field | Type | Description |
|---|---|---|
rsentry |
RequestStateEntry |
The request state entry to prefill |
max_prefill_length |
int |
Maximum number of tokens allowed for this prefill operation |
num_child_to_activate |
int |
Number of child entries to activate (relevant for speculative decoding tree structures) |
is_decode |
bool |
Whether this entry is a decode operation rather than a true prefill |
Class: BatchPrefillBaseActionObj
Inherits from EngineActionObj and provides the shared prefill logic. The class is abstract due to the pure virtual MatchPrefixCache method.
Constructor
BatchPrefillBaseActionObj(Array<Model> models, EngineConfig engine_config,
std::vector<picojson::object> model_configs,
Optional<EventTraceRecorder> trace_recorder);
Initializes the base prefill action with models, engine configuration, per-model configuration objects, and an optional trace recorder.
GetRequestStateEntriesToPrefill
std::vector<PrefillInput> GetRequestStateEntriesToPrefill(EngineState estate);
Selects one or more request state entries from the engine state's waiting queue that are eligible for prefill. Returns a vector of PrefillInput structures containing the selected entries along with their computed maximum prefill lengths.
CanPrefill
bool CanPrefill(EngineState estate, int num_prefill_rsentries, int total_input_length,
int num_required_pages, int num_available_pages, int current_total_seq_len,
int num_running_rsentries, KVStateKind kv_state_kind,
bool sliding_window_enabled);
Determines whether the selected requests can be prefilled given current resource constraints. Checks memory availability (KV cache pages), sequence length limits, and batch size constraints.
ChunkPrefillInputData
std::pair<Array<Data>, int> ChunkPrefillInputData(const RequestModelState& mstate,
int max_prefill_length);
Splits the input data of a RequestModelState into chunks that fit within the specified maximum prefill length. Returns a pair of the chunked input data array and the total prefill length. Side effect: Mutates the inputs field of mstate to exclude the returned input, enabling incremental chunked prefill across multiple action steps.
UpdateRequestToAlive
void UpdateRequestToAlive(const std::vector<PrefillInput>& prefill_inputs,
const EngineState& estate, Array<String>* request_ids,
std::vector<RequestState>* rstates_of_entries,
std::vector<RequestStateStatus>* status_before_prefill);
Transitions request states from pending to alive status. Collects the request IDs, request states, and pre-prefill status into the provided output parameters. This is called at the beginning of prefill execution to mark requests as actively being processed.
RemoveProcessedRequests
std::vector<Request> RemoveProcessedRequests(const std::vector<PrefillInput>& prefill_inputs,
const EngineState& estate,
const std::vector<RequestState>& rstates_of_entries);
Removes requests from the waiting queue when all of their request states are alive and have no remaining chunked inputs to process. Returns the list of requests that were fully processed and moved to the running queue.
UpdateRequestStateEntriesWithSampleResults
void UpdateRequestStateEntriesWithSampleResults(
const std::vector<RequestStateEntry>& rsentries_for_sample,
const std::vector<bool>& rsentry_activated,
const std::vector<SampleResult>& sample_results);
Updates committed tokens in request model states based on sampling results. For first-time prefilled requests, also records the prefill finish time in the request metrics.
GetConcatPrefillInputData
std::vector<int32_t> GetConcatPrefillInputData(const RequestModelState& mstate);
Concatenates all tokenized input data from a RequestModelState into a single flat vector of token IDs. Returns an empty vector if untokenized data is present (since untokenized data cannot be concatenated as integer tokens).
PopPrefillInputData
void PopPrefillInputData(const RequestModelState& mstate, size_t num_tokens);
Removes the first num_tokens prefix tokens from the input data array of a RequestModelState. Used after prefix cache matching to skip tokens that are already cached.
MatchPrefixCache (Pure Virtual)
virtual int MatchPrefixCache(EngineState estate, PrefillInput* input) = 0;
Abstract method that subclasses must implement to perform prefix cache matching. When a request's input tokens match a prefix already in the KV cache, this method skips redundant computation by forking from the cached state. Returns the matched prefix length.
Member Variables
| Member | Type | Description |
|---|---|---|
models_ |
Array<Model> |
The models to run prefill on |
engine_config_ |
EngineConfig |
The engine configuration |
kv_state_kind_ |
KVStateKind |
The type of KV state (e.g., standard, paged) |
sliding_window_sizes_ |
std::vector<int> |
Per-model sliding window sizes |
trace_recorder_ |
Optional<EventTraceRecorder> |
Optional event trace recorder |
Free Function: HasPrefillSpace
bool HasPrefillSpace(int num_required_pages, bool sliding_window_enabled, int new_batch_size,
int num_available_pages, int current_total_seq_len, int total_input_length,
int max_total_sequence_length);
A standalone utility function that checks whether the KV cache has sufficient spare capacity for a prefill operation, considering the number of required pages, current total sequence length, and the maximum total sequence length limit.
Design Notes
- The abstract
MatchPrefixCachemethod enables different prefix cache strategies (e.g., radix-tree based, hash-based) to be plugged into the same prefill pipeline. - Chunked prefill support (via
ChunkPrefillInputDataandPopPrefillInputData) allows long prompts to be processed incrementally, preventing out-of-memory conditions and enabling better batching. - The
PrefillInput::is_decodeflag handles the case where a "prefill" action actually needs to perform a decode step (e.g., when the input is already fully cached and only a single new token needs to be processed).