Implementation:Mlc ai Mlc llm Draft Token Workspace
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Speculative Decoding, Memory Management |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
The Draft Token Workspace Manager is a memory management component for speculative decoding in the MLC LLM serving engine. It manages a pool of workspace slots that store intermediate states associated with draft tokens during speculative inference, including probability distributions and hidden states.
Description
This header file (cpp/serve/draft_token_workspace_manager.h) defines the DraftTokenWorkspaceManagerObj class that manages workspace memory for draft token generation in speculative decoding scenarios. In speculative decoding, a smaller "draft" model proposes multiple candidate tokens, and each proposed token has associated state (probability distributions, hidden states) that must be stored temporarily.
The workspace manager provides:
- Slot-based allocation: Maintains a pool of
free_slots_that can be allocated and released. Each slot corresponds to storage for one draft token's associated state. - Reference counting: Tracks slot usage through
ref_count_, allowing slots to be shared across multiple consumers and freed only when all references are released. - Workspace allocation: Allocates the underlying tensors (probability distributions and optionally hidden states) and stores them in the
ModelWorkspacedata structure. - Two
AllocSlotsoverloads: One allocates with a default reference count and another accepts explicit initial reference counts per slot.
The class stores configuration parameters including:
max_num_tokens_-- maximum number of draft tokens in the poolvocab_size_-- vocabulary size (for probability distribution tensors)hidden_size_-- hidden state dimensionhidden_states_dtype_-- data type for hidden state tensorsdevice_-- the target deviceft_-- reference to the function table for tensor operations
Usage
The draft token workspace manager is created during engine initialization when speculative decoding is enabled (i.e., when there are multiple models). It is used by the Eagle-style speculative decoding actions:
- During
EagleNewRequestPrefillandEagleBatchDraft, slots are allocated for draft token states. - During
EagleBatchVerify, slots are consumed and freed after verification. - The workspace is allocated once per step and reused across the batch.
Code Reference
Source Location
| Property | Value |
|---|---|
| File | cpp/serve/draft_token_workspace_manager.h
|
| Namespace | mlc::llm::serve
|
| Lines | 115 |
| Include Guard | MLC_LLM_SERVE_DRAFT_TOKEN_WORKSPACE_MANAGER_H_
|
Signature
namespace mlc {
namespace llm {
namespace serve {
class DraftTokenWorkspaceManagerObj : public Object {
public:
DraftTokenWorkspaceManagerObj(int max_num_tokens, int vocab_size, int hidden_size,
DLDataType hidden_states_dtype, DLDevice device,
const FunctionTable& ft);
void AllocWorkspace(ModelWorkspace* workspace, bool require_hidden_states);
void AllocSlots(int num_slots, std::vector<int>* result);
void AllocSlots(int num_slots, const std::vector<int>& initial_ref_count,
std::vector<int>* result);
void FreeSlots(const std::vector<int>& slots);
private:
std::vector<int> free_slots_;
int max_num_tokens_;
int vocab_size_;
int hidden_size_;
DataType hidden_states_dtype_;
DLDevice device_;
const FunctionTable& ft_;
std::unordered_map<int, int> ref_count_;
};
class DraftTokenWorkspaceManager : public ObjectRef {
public:
DraftTokenWorkspaceManager(int max_num_tokens, int vocab_size, int hidden_size,
DLDataType hidden_states_dtype, DLDevice device,
const FunctionTable& ft);
};
} // namespace serve
} // namespace llm
} // namespace mlc
Import
#include "serve/draft_token_workspace_manager.h"
Dependencies:
tvm/ffi/reflection/registry.hfor TVM reflection macrostvm/runtime/device_api.hfor device typesnumeric,optional,vector(standard library)data.hfor serving data typesfunction_table.hfor theFunctionTabletype
I/O Contract
Constructor
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | max_num_tokens | int |
Maximum number of draft token slots in the workspace pool |
| Input | vocab_size | int |
Size of the vocabulary for probability distribution tensors |
| Input | hidden_size | int |
Dimension of the hidden state vectors |
| Input | hidden_states_dtype | DLDataType |
Data type for hidden state tensors (e.g., float16) |
| Input | device | DLDevice |
Target device (CPU/GPU) for tensor allocation |
| Input | ft | const FunctionTable& |
Function table for tensor operations |
AllocWorkspace
| Direction | Name | Type | Description |
|---|---|---|---|
| Input/Output | workspace | ModelWorkspace* |
Workspace object to be populated with allocated tensors |
| Input | require_hidden_states | bool |
Whether to also allocate hidden state workspace |
AllocSlots
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | num_slots | int |
Number of slots to allocate from the free pool |
| Input | initial_ref_count | const std::vector<int>& |
(optional overload) Initial reference count per slot |
| Output | result | std::vector<int>* |
Output vector populated with allocated slot indices |
FreeSlots
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | slots | const std::vector<int>& |
Slot indices to release back to the free pool |
Usage Examples
Creating a workspace manager:
#include "serve/draft_token_workspace_manager.h"
DraftTokenWorkspaceManager manager(
/*max_num_tokens=*/256,
/*vocab_size=*/32000,
/*hidden_size=*/4096,
/*hidden_states_dtype=*/DLDataType{kDLFloat, 16, 1},
/*device=*/device,
/*ft=*/function_table
);
Allocating and freeing slots during speculative decoding:
// Allocate workspace tensors
manager->AllocWorkspace(&model_workspace, /*require_hidden_states=*/true);
// Allocate slots for draft tokens
std::vector<int> slots;
manager->AllocSlots(/*num_slots=*/5, &slots);
// slots now contains 5 slot indices
// After verification, free the slots
manager->FreeSlots(slots);
Allocating with reference counts:
std::vector<int> ref_counts = {1, 2, 1, 1, 3};
std::vector<int> slots;
manager->AllocSlots(5, ref_counts, &slots);
// Slot at index 1 has ref_count=2, slot at index 4 has ref_count=3, etc.
Related Pages
- Mlc_ai_Mlc_llm_Engine_Action - Engine actions that use the workspace manager for Eagle speculative decoding
- Mlc_ai_Mlc_llm_Engine_Interface - The engine that orchestrates speculative decoding
- Mlc_ai_Mlc_llm_Serve_Data_Header - Data types used alongside workspace management