Implementation:Mlc ai Mlc llm Batch Prefill Base

Overview

File: cpp/serve/engine_actions/batch_prefill_base.h

Purpose: Defines the abstract base class BatchPrefillBaseActionObj for prefill engine actions. Prefill is the initial phase of LLM inference where the model processes the full input prompt before generating tokens. This base class provides the common infrastructure for selecting requests from the waiting queue, chunking input data, managing prefix cache matching, and updating request states after sampling. Concrete subclasses implement the actual prefill execution and prefix cache matching strategy.

Namespace: mlc::llm::serve

Struct: PrefillInput

struct PrefillInput {
  RequestStateEntry rsentry;
  int max_prefill_length = 0;
  int num_child_to_activate = 0;
  bool is_decode = false;
};

A protected inner structure that bundles a request state entry with its prefill parameters:

Field	Type	Description
`rsentry`	`RequestStateEntry`	The request state entry to prefill
`max_prefill_length`	`int`	Maximum number of tokens allowed for this prefill operation
`num_child_to_activate`	`int`	Number of child entries to activate (relevant for speculative decoding tree structures)
`is_decode`	`bool`	Whether this entry is a decode operation rather than a true prefill

Class: BatchPrefillBaseActionObj

Inherits from EngineActionObj and provides the shared prefill logic. The class is abstract due to the pure virtual MatchPrefixCache method.

Constructor

BatchPrefillBaseActionObj(Array<Model> models, EngineConfig engine_config,
                          std::vector<picojson::object> model_configs,
                          Optional<EventTraceRecorder> trace_recorder);

Initializes the base prefill action with models, engine configuration, per-model configuration objects, and an optional trace recorder.

GetRequestStateEntriesToPrefill

std::vector<PrefillInput> GetRequestStateEntriesToPrefill(EngineState estate);

Selects one or more request state entries from the engine state's waiting queue that are eligible for prefill. Returns a vector of PrefillInput structures containing the selected entries along with their computed maximum prefill lengths.

CanPrefill

bool CanPrefill(EngineState estate, int num_prefill_rsentries, int total_input_length,
                int num_required_pages, int num_available_pages, int current_total_seq_len,
                int num_running_rsentries, KVStateKind kv_state_kind,
                bool sliding_window_enabled);

Determines whether the selected requests can be prefilled given current resource constraints. Checks memory availability (KV cache pages), sequence length limits, and batch size constraints.

ChunkPrefillInputData

std::pair<Array<Data>, int> ChunkPrefillInputData(const RequestModelState& mstate,
                                                  int max_prefill_length);

Splits the input data of a RequestModelState into chunks that fit within the specified maximum prefill length. Returns a pair of the chunked input data array and the total prefill length. Side effect: Mutates the inputs field of mstate to exclude the returned input, enabling incremental chunked prefill across multiple action steps.

UpdateRequestToAlive

void UpdateRequestToAlive(const std::vector<PrefillInput>& prefill_inputs,
                          const EngineState& estate, Array<String>* request_ids,
                          std::vector<RequestState>* rstates_of_entries,
                          std::vector<RequestStateStatus>* status_before_prefill);

Transitions request states from pending to alive status. Collects the request IDs, request states, and pre-prefill status into the provided output parameters. This is called at the beginning of prefill execution to mark requests as actively being processed.

RemoveProcessedRequests

std::vector<Request> RemoveProcessedRequests(const std::vector<PrefillInput>& prefill_inputs,
                                             const EngineState& estate,
                                             const std::vector<RequestState>& rstates_of_entries);

Removes requests from the waiting queue when all of their request states are alive and have no remaining chunked inputs to process. Returns the list of requests that were fully processed and moved to the running queue.

UpdateRequestStateEntriesWithSampleResults

void UpdateRequestStateEntriesWithSampleResults(
    const std::vector<RequestStateEntry>& rsentries_for_sample,
    const std::vector<bool>& rsentry_activated,
    const std::vector<SampleResult>& sample_results);

Updates committed tokens in request model states based on sampling results. For first-time prefilled requests, also records the prefill finish time in the request metrics.

GetConcatPrefillInputData

std::vector<int32_t> GetConcatPrefillInputData(const RequestModelState& mstate);

Concatenates all tokenized input data from a RequestModelState into a single flat vector of token IDs. Returns an empty vector if untokenized data is present (since untokenized data cannot be concatenated as integer tokens).

PopPrefillInputData

void PopPrefillInputData(const RequestModelState& mstate, size_t num_tokens);

Removes the first num_tokens prefix tokens from the input data array of a RequestModelState. Used after prefix cache matching to skip tokens that are already cached.

MatchPrefixCache (Pure Virtual)

virtual int MatchPrefixCache(EngineState estate, PrefillInput* input) = 0;

Abstract method that subclasses must implement to perform prefix cache matching. When a request's input tokens match a prefix already in the KV cache, this method skips redundant computation by forking from the cached state. Returns the matched prefix length.

Member Variables

Member	Type	Description
`models_`	`Array<Model>`	The models to run prefill on
`engine_config_`	`EngineConfig`	The engine configuration
`kv_state_kind_`	`KVStateKind`	The type of KV state (e.g., standard, paged)
`sliding_window_sizes_`	`std::vector<int>`	Per-model sliding window sizes
`trace_recorder_`	`Optional<EventTraceRecorder>`	Optional event trace recorder

Free Function: HasPrefillSpace

bool HasPrefillSpace(int num_required_pages, bool sliding_window_enabled, int new_batch_size,
                     int num_available_pages, int current_total_seq_len, int total_input_length,
                     int max_total_sequence_length);

A standalone utility function that checks whether the KV cache has sufficient spare capacity for a prefill operation, considering the number of required pages, current total sequence length, and the maximum total sequence length limit.

Design Notes

The abstract MatchPrefixCache method enables different prefix cache strategies (e.g., radix-tree based, hash-based) to be plugged into the same prefill pipeline.
Chunked prefill support (via ChunkPrefillInputData and PopPrefillInputData) allows long prompts to be processed incrementally, preventing out-of-memory conditions and enabling better batching.
The PrefillInput::is_decode flag handles the case where a "prefill" action actually needs to perform a decode step (e.g., when the input is already fully cached and only a single new token needs to be processed).

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment