Implementation:Mlc ai Mlc llm Draft Token Workspace

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	LLM Serving, Speculative Decoding, Memory Management
Last Updated	2026-02-09 19:00 GMT

Overview

The Draft Token Workspace Manager is a memory management component for speculative decoding in the MLC LLM serving engine. It manages a pool of workspace slots that store intermediate states associated with draft tokens during speculative inference, including probability distributions and hidden states.

Description

This header file (cpp/serve/draft_token_workspace_manager.h) defines the DraftTokenWorkspaceManagerObj class that manages workspace memory for draft token generation in speculative decoding scenarios. In speculative decoding, a smaller "draft" model proposes multiple candidate tokens, and each proposed token has associated state (probability distributions, hidden states) that must be stored temporarily.

The workspace manager provides:

Slot-based allocation: Maintains a pool of free_slots_ that can be allocated and released. Each slot corresponds to storage for one draft token's associated state.
Reference counting: Tracks slot usage through ref_count_, allowing slots to be shared across multiple consumers and freed only when all references are released.
Workspace allocation: Allocates the underlying tensors (probability distributions and optionally hidden states) and stores them in the ModelWorkspace data structure.
Two AllocSlots overloads: One allocates with a default reference count and another accepts explicit initial reference counts per slot.

The class stores configuration parameters including:

max_num_tokens_ -- maximum number of draft tokens in the pool
vocab_size_ -- vocabulary size (for probability distribution tensors)
hidden_size_ -- hidden state dimension
hidden_states_dtype_ -- data type for hidden state tensors
device_ -- the target device
ft_ -- reference to the function table for tensor operations

Usage

The draft token workspace manager is created during engine initialization when speculative decoding is enabled (i.e., when there are multiple models). It is used by the Eagle-style speculative decoding actions:

During EagleNewRequestPrefill and EagleBatchDraft, slots are allocated for draft token states.
During EagleBatchVerify, slots are consumed and freed after verification.
The workspace is allocated once per step and reused across the batch.

Code Reference

Source Location

Property	Value
File	`cpp/serve/draft_token_workspace_manager.h`
Namespace	`mlc::llm::serve`
Lines	115
Include Guard	`MLC_LLM_SERVE_DRAFT_TOKEN_WORKSPACE_MANAGER_H_`

Signature

namespace mlc {
namespace llm {
namespace serve {

class DraftTokenWorkspaceManagerObj : public Object {
 public:
  DraftTokenWorkspaceManagerObj(int max_num_tokens, int vocab_size, int hidden_size,
                                DLDataType hidden_states_dtype, DLDevice device,
                                const FunctionTable& ft);

  void AllocWorkspace(ModelWorkspace* workspace, bool require_hidden_states);

  void AllocSlots(int num_slots, std::vector<int>* result);

  void AllocSlots(int num_slots, const std::vector<int>& initial_ref_count,
                  std::vector<int>* result);

  void FreeSlots(const std::vector<int>& slots);

 private:
  std::vector<int> free_slots_;
  int max_num_tokens_;
  int vocab_size_;
  int hidden_size_;
  DataType hidden_states_dtype_;
  DLDevice device_;
  const FunctionTable& ft_;
  std::unordered_map<int, int> ref_count_;
};

class DraftTokenWorkspaceManager : public ObjectRef {
 public:
  DraftTokenWorkspaceManager(int max_num_tokens, int vocab_size, int hidden_size,
                             DLDataType hidden_states_dtype, DLDevice device,
                             const FunctionTable& ft);
};

}  // namespace serve
}  // namespace llm
}  // namespace mlc

Import

#include "serve/draft_token_workspace_manager.h"

Dependencies:

tvm/ffi/reflection/registry.h for TVM reflection macros
tvm/runtime/device_api.h for device types
numeric, optional, vector (standard library)
data.h for serving data types
function_table.h for the FunctionTable type

I/O Contract

Constructor

Direction	Name	Type	Description
Input	max_num_tokens	`int`	Maximum number of draft token slots in the workspace pool
Input	vocab_size	`int`	Size of the vocabulary for probability distribution tensors
Input	hidden_size	`int`	Dimension of the hidden state vectors
Input	hidden_states_dtype	`DLDataType`	Data type for hidden state tensors (e.g., float16)
Input	device	`DLDevice`	Target device (CPU/GPU) for tensor allocation
Input	ft	`const FunctionTable&`	Function table for tensor operations

AllocWorkspace

Direction	Name	Type	Description
Input/Output	workspace	`ModelWorkspace*`	Workspace object to be populated with allocated tensors
Input	require_hidden_states	`bool`	Whether to also allocate hidden state workspace

AllocSlots

Direction	Name	Type	Description
Input	num_slots	`int`	Number of slots to allocate from the free pool
Input	initial_ref_count	`const std::vector<int>&`	(optional overload) Initial reference count per slot
Output	result	`std::vector<int>*`	Output vector populated with allocated slot indices

FreeSlots

Direction	Name	Type	Description
Input	slots	`const std::vector<int>&`	Slot indices to release back to the free pool

Usage Examples

Creating a workspace manager:

#include "serve/draft_token_workspace_manager.h"

DraftTokenWorkspaceManager manager(
    /*max_num_tokens=*/256,
    /*vocab_size=*/32000,
    /*hidden_size=*/4096,
    /*hidden_states_dtype=*/DLDataType{kDLFloat, 16, 1},
    /*device=*/device,
    /*ft=*/function_table
);

Allocating and freeing slots during speculative decoding:

// Allocate workspace tensors
manager->AllocWorkspace(&model_workspace, /*require_hidden_states=*/true);

// Allocate slots for draft tokens
std::vector<int> slots;
manager->AllocSlots(/*num_slots=*/5, &slots);
// slots now contains 5 slot indices

// After verification, free the slots
manager->FreeSlots(slots);

Allocating with reference counts:

std::vector<int> ref_counts = {1, 2, 1, 1, 3};
std::vector<int> slots;
manager->AllocSlots(5, ref_counts, &slots);
// Slot at index 1 has ref_count=2, slot at index 4 has ref_count=3, etc.

Related Pages

Mlc_ai_Mlc_llm_Engine_Action - Engine actions that use the workspace manager for Eagle speculative decoding
Mlc_ai_Mlc_llm_Engine_Interface - The engine that orchestrates speculative decoding
Mlc_ai_Mlc_llm_Serve_Data_Header - Data types used alongside workspace management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment