Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Sampler Header

From Leeroopedia
Revision as of 15:52, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mlc_ai_Mlc_llm_Sampler_Header.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

The file cpp/serve/sampler/sampler.h declares the Sampler abstraction for the MLC-LLM serving engine. The Sampler is responsible for drawing tokens from probability distributions produced by the logit processor. It supports both CPU and GPU sampling backends, batch operations for efficient multi-request processing, and speculative decoding verification. The design follows the TVM object system pattern with an abstract SamplerObj base class and a Sampler managed reference type.

File Location

cpp/serve/sampler/sampler.h

Dependencies

Header Purpose
tvm/ffi/string.h TVM String type for request IDs
tvm/runtime/module.h TVM runtime module and tensor types
../../base.h MLC-LLM base definitions
../../support/random.h RandomGenerator for reproducible sampling
../data.h Data type definitions
../event_trace_recorder.h Optional event tracing support
../model.h Model interface and FunctionTable
../request_state.h SampleResult and RequestModelState

Namespace

All types are defined in mlc::llm::serve. The header imports tvm::Device and the full tvm::runtime namespace.

Class: SamplerObj

Abstract base class inheriting from tvm::runtime::Object, defining four pure virtual methods for batch sampling operations.

Method: BatchRenormalizeProbsByTopP

virtual Tensor BatchRenormalizeProbsByTopP(Tensor probs_on_device,
                                           const std::vector<int>& sample_indices,
                                           const Array<String>& request_ids,
                                           const Array<GenerationConfig>& generation_cfg) = 0;

Renormalizes probability distributions by applying top-p (nucleus) filtering. The sample_indices parameter maps output positions to rows in the probability tensor, enabling flexible batching where:

result[i] = sample_from(probs_on_device[sample_indices[i], :], generation_cfg[i])

Returns the renormalized distributions, residing on GPU for GPU samplers or on host for CPU samplers.

Method: BatchSampleTokensWithProbBeforeTopP

virtual std::vector<SampleResult> BatchSampleTokensWithProbBeforeTopP(
    Tensor probs_on_device,
    const std::vector<int>& sample_indices,
    const Array<String>& request_ids,
    const Array<GenerationConfig>& generation_cfg,
    const std::vector<RandomGenerator*>& rngs) = 0;

Samples tokens from probability distributions that have not yet been filtered by top-p. This method internally applies top-p before sampling. Each sequence uses its own RandomGenerator for reproducibility.

Returns a vector of SampleResult objects containing the sampled token IDs and associated probability information.

Method: BatchSampleTokensWithProbAfterTopP

virtual std::vector<SampleResult> BatchSampleTokensWithProbAfterTopP(
    Tensor probs,
    const std::vector<int>& sample_indices,
    const Array<String>& request_ids,
    const Array<GenerationConfig>& generation_cfg,
    const std::vector<RandomGenerator*>& rngs) = 0;

Samples tokens from probability distributions that have already been filtered by top-p. The input tensor may reside on GPU or host depending on the sampler type.

Method: BatchVerifyDraftTokensWithProbAfterTopP

virtual std::pair<std::vector<std::vector<SampleResult>>, std::vector<int>>
BatchVerifyDraftTokensWithProbAfterTopP(
    Tensor probs, const Array<String>& request_ids,
    const std::vector<int>& cum_verify_lengths,
    const Array<GenerationConfig>& generation_cfg,
    const std::vector<RandomGenerator*>& rngs,
    const std::vector<std::vector<SampleResult>>& draft_output_tokens,
    const std::vector<int64_t>& token_tree_parent_ptr,
    Tensor draft_probs_on_device) = 0;

Verifies draft tokens from speculative decoding against the large model's probability distributions. This is the core of the speculative decoding acceptance/rejection step. Parameters include:

Parameter Description
probs Large model's probability distributions (post-top-p)
cum_verify_lengths Cumulative lengths of draft sequences to verify
draft_output_tokens Draft tokens generated by the small model
token_tree_parent_ptr Parent pointers defining the draft token tree structure
draft_probs_on_device Small model's probability distributions for rejection sampling

Returns:

  1. A vector of accepted token lists per request
  2. A vector of indices identifying the last accepted tree node per request

TVM Object Registration

static constexpr const bool _type_has_method_sequal_reduce = false;
static constexpr const bool _type_has_method_shash_reduce = false;
static constexpr const bool _type_mutable = true;
TVM_FFI_DECLARE_OBJECT_INFO("mlc.serve.Sampler", SamplerObj, Object);

Registered under type key "mlc.serve.Sampler". Marked as mutable; structural equality and hash are disabled.

Class: Sampler

The managed reference type for SamplerObj, providing factory methods and a device capability check.

Factory: CreateCPUSampler

static Sampler CreateCPUSampler(Optional<EventTraceRecorder> trace_recorder);

Creates a CPU-based sampler. This is the fallback for devices that do not support GPU sampling.

Factory: CreateGPUSampler

static Sampler CreateGPUSampler(int max_num_sample, int vocab_size, FunctionTable* ft,
                                DLDevice device, Optional<EventTraceRecorder> trace_recorder);

Creates a GPU-based sampler with the given capacity and vocabulary size. The FunctionTable provides access to compiled GPU sampling kernels.

Parameter Description
max_num_sample Maximum number of samples in a single batch operation
vocab_size Model vocabulary size
ft Function table for GPU kernel dispatch
device Target GPU device
trace_recorder Optional event trace recorder

Static Method: SupportGPUSampler

static bool SupportGPUSampler(Device device) {
    return device.device_type == DLDeviceType::kDLCUDA ||
           device.device_type == DLDeviceType::kDLVulkan ||
           device.device_type == DLDeviceType::kDLMetal;
}

Returns true if the device supports GPU sampling. Supported backends are:

  • CUDA -- NVIDIA GPUs
  • Vulkan -- Cross-platform GPU compute
  • Metal -- Apple GPUs

Role in the Serving Pipeline

The Sampler operates after the LogitProcessor:

  1. The LogitProcessor produces probability distributions from raw logits.
  2. BatchRenormalizeProbsByTopP optionally renormalizes with top-p filtering.
  3. BatchSampleTokensWithProbBeforeTopP or BatchSampleTokensWithProbAfterTopP draws tokens.
  4. For speculative decoding, BatchVerifyDraftTokensWithProbAfterTopP accepts or rejects draft tokens.
  5. The sampled SampleResult tokens are committed to RequestModelState via CommitToken.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment