Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Draft Token Workspace

From Leeroopedia


Knowledge Sources
Domains LLM Serving, Speculative Decoding, Memory Management
Last Updated 2026-02-09 19:00 GMT

Overview

The Draft Token Workspace Manager is a memory management component for speculative decoding in the MLC LLM serving engine. It manages a pool of workspace slots that store intermediate states associated with draft tokens during speculative inference, including probability distributions and hidden states.

Description

This header file (cpp/serve/draft_token_workspace_manager.h) defines the DraftTokenWorkspaceManagerObj class that manages workspace memory for draft token generation in speculative decoding scenarios. In speculative decoding, a smaller "draft" model proposes multiple candidate tokens, and each proposed token has associated state (probability distributions, hidden states) that must be stored temporarily.

The workspace manager provides:

  • Slot-based allocation: Maintains a pool of free_slots_ that can be allocated and released. Each slot corresponds to storage for one draft token's associated state.
  • Reference counting: Tracks slot usage through ref_count_, allowing slots to be shared across multiple consumers and freed only when all references are released.
  • Workspace allocation: Allocates the underlying tensors (probability distributions and optionally hidden states) and stores them in the ModelWorkspace data structure.
  • Two AllocSlots overloads: One allocates with a default reference count and another accepts explicit initial reference counts per slot.

The class stores configuration parameters including:

  • max_num_tokens_ -- maximum number of draft tokens in the pool
  • vocab_size_ -- vocabulary size (for probability distribution tensors)
  • hidden_size_ -- hidden state dimension
  • hidden_states_dtype_ -- data type for hidden state tensors
  • device_ -- the target device
  • ft_ -- reference to the function table for tensor operations

Usage

The draft token workspace manager is created during engine initialization when speculative decoding is enabled (i.e., when there are multiple models). It is used by the Eagle-style speculative decoding actions:

  1. During EagleNewRequestPrefill and EagleBatchDraft, slots are allocated for draft token states.
  2. During EagleBatchVerify, slots are consumed and freed after verification.
  3. The workspace is allocated once per step and reused across the batch.

Code Reference

Source Location

Property Value
File cpp/serve/draft_token_workspace_manager.h
Namespace mlc::llm::serve
Lines 115
Include Guard MLC_LLM_SERVE_DRAFT_TOKEN_WORKSPACE_MANAGER_H_

Signature

namespace mlc {
namespace llm {
namespace serve {

class DraftTokenWorkspaceManagerObj : public Object {
 public:
  DraftTokenWorkspaceManagerObj(int max_num_tokens, int vocab_size, int hidden_size,
                                DLDataType hidden_states_dtype, DLDevice device,
                                const FunctionTable& ft);

  void AllocWorkspace(ModelWorkspace* workspace, bool require_hidden_states);

  void AllocSlots(int num_slots, std::vector<int>* result);

  void AllocSlots(int num_slots, const std::vector<int>& initial_ref_count,
                  std::vector<int>* result);

  void FreeSlots(const std::vector<int>& slots);

 private:
  std::vector<int> free_slots_;
  int max_num_tokens_;
  int vocab_size_;
  int hidden_size_;
  DataType hidden_states_dtype_;
  DLDevice device_;
  const FunctionTable& ft_;
  std::unordered_map<int, int> ref_count_;
};

class DraftTokenWorkspaceManager : public ObjectRef {
 public:
  DraftTokenWorkspaceManager(int max_num_tokens, int vocab_size, int hidden_size,
                             DLDataType hidden_states_dtype, DLDevice device,
                             const FunctionTable& ft);
};

}  // namespace serve
}  // namespace llm
}  // namespace mlc

Import

#include "serve/draft_token_workspace_manager.h"

Dependencies:

  • tvm/ffi/reflection/registry.h for TVM reflection macros
  • tvm/runtime/device_api.h for device types
  • numeric, optional, vector (standard library)
  • data.h for serving data types
  • function_table.h for the FunctionTable type

I/O Contract

Constructor

Direction Name Type Description
Input max_num_tokens int Maximum number of draft token slots in the workspace pool
Input vocab_size int Size of the vocabulary for probability distribution tensors
Input hidden_size int Dimension of the hidden state vectors
Input hidden_states_dtype DLDataType Data type for hidden state tensors (e.g., float16)
Input device DLDevice Target device (CPU/GPU) for tensor allocation
Input ft const FunctionTable& Function table for tensor operations

AllocWorkspace

Direction Name Type Description
Input/Output workspace ModelWorkspace* Workspace object to be populated with allocated tensors
Input require_hidden_states bool Whether to also allocate hidden state workspace

AllocSlots

Direction Name Type Description
Input num_slots int Number of slots to allocate from the free pool
Input initial_ref_count const std::vector<int>& (optional overload) Initial reference count per slot
Output result std::vector<int>* Output vector populated with allocated slot indices

FreeSlots

Direction Name Type Description
Input slots const std::vector<int>& Slot indices to release back to the free pool

Usage Examples

Creating a workspace manager:

#include "serve/draft_token_workspace_manager.h"

DraftTokenWorkspaceManager manager(
    /*max_num_tokens=*/256,
    /*vocab_size=*/32000,
    /*hidden_size=*/4096,
    /*hidden_states_dtype=*/DLDataType{kDLFloat, 16, 1},
    /*device=*/device,
    /*ft=*/function_table
);

Allocating and freeing slots during speculative decoding:

// Allocate workspace tensors
manager->AllocWorkspace(&model_workspace, /*require_hidden_states=*/true);

// Allocate slots for draft tokens
std::vector<int> slots;
manager->AllocSlots(/*num_slots=*/5, &slots);
// slots now contains 5 slot indices

// After verification, free the slots
manager->FreeSlots(slots);

Allocating with reference counts:

std::vector<int> ref_counts = {1, 2, 1, 1, 3};
std::vector<int> slots;
manager->AllocSlots(5, ref_counts, &slots);
// Slot at index 1 has ref_count=2, slot at index 4 has ref_count=3, etc.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment