Implementation:Mlc ai Mlc llm Logit Processor

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	LLM Serving, Sampling, GPU Computing
Last Updated	2026-02-09 19:00 GMT

Overview

LogitProcessor implements in-place logit transformations for the MLC LLM serving engine, applying logit bias, repetition/frequency/presence penalties, vocabulary bitmasks, and temperature-scaled softmax to produce sampling-ready probability distributions.

Description

The logit_processor.cc file implements the LogitProcessorImpl class, which is the core component responsible for transforming raw model logits into probability distributions suitable for token sampling. It operates in the mlc::llm::serve namespace.

The constructor allocates paired CPU-GPU auxiliary tensors for all intermediate data structures needed during logit processing: sequence IDs, position-to-sequence ID mappings, token IDs, token counts, logit biases, penalties, bitmasks, and temperatures. Each tensor is allocated on both the host (CPU) and device (GPU) for the copy-then-compute pattern. On CUDA/ROCm devices, a dedicated copy stream is created to overlap data transfer with computation, reducing latency.

InplaceUpdateLogits is the main entry point that applies three sequential transformations to the logit tensor:

Logit Bias (UpdateWithLogitBias): For each generation config that specifies a logit bias map, the method constructs sparse arrays of position-to-sequence mappings, token IDs, and bias values. These are copied to the GPU and applied via the apply_logit_bias_inplace kernel.

Penalties (UpdateWithPenalty): For sequences with non-default frequency penalty, presence penalty, or repetition penalty, the method constructs arrays of appeared token IDs, their counts, and the three penalty values per sequence. It handles draft tokens from speculative decoding by temporarily adding them to the appeared token tracking, processing, and then rolling back. The apply_penalty_inplace kernel applies these penalties on the GPU.

Vocabulary Mask (UpdateWithMask): For sequences that require a next-token bitmask (typically from grammar-guided generation), the method retrieves bitmasks from the grammar matcher, handling draft token sequences by temporarily accepting and then rolling back draft tokens. The bitmask is a packed 32-bit integer representation with ceil(vocab_size / 32) elements per sequence. The apply_bitmask_inplace kernel sets masked positions to the minimum logit value. This step is deliberately placed last because the resulting minimum values could cause numerical issues (underflow) if further subtractions were applied.

ComputeProbsFromLogits converts the bias/penalty/mask-adjusted logits into probability distributions via temperature-scaled softmax. It constructs a temperature array from each sequence's generation config, copies it to the GPU, and calls the softmax kernel which takes logits of shape (n, 1, v) and temperatures of shape (n,) to produce probabilities of shape (n, v).

All methods use the dual-stream pattern: data is constructed on CPU, copied to GPU via the copy stream, the copy stream is synchronized with the compute stream, and then GPU kernels are launched on the compute stream.

Usage

Use LogitProcessor within the serving engine after model inference produces raw logits and before sampling. It is created via Model::CreateLogitProcessor and is called for both prefill and decode steps.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: cpp/serve/logit_processor.cc
Lines: 1-506

Signature

class LogitProcessorImpl : public LogitProcessorObj {
public:
  explicit LogitProcessorImpl(int max_num_token, int vocab_size,
                              FunctionTable* ft, DLDevice device,
                              Optional<EventTraceRecorder> trace_recorder);
  ~LogitProcessorImpl();

  void InplaceUpdateLogits(
      Tensor logits,
      const Array<GenerationConfig>& generation_cfg,
      const Array<RequestModelState>& mstates,
      const Array<String>& request_ids,
      const std::vector<int>* cum_num_token,
      const Array<RequestModelState>* draft_mstates,
      const std::vector<std::vector<int>>* draft_token_indices) final;

  Tensor ComputeProbsFromLogits(
      Tensor logits,
      const Array<GenerationConfig>& generation_cfg,
      const Array<String>& request_ids,
      const std::vector<int>* cum_num_token) final;
};

// Constructor via the LogitProcessor wrapper
LogitProcessor::LogitProcessor(int max_num_token, int vocab_size,
                               FunctionTable* ft, DLDevice device,
                               Optional<EventTraceRecorder> trace_recorder);

Import

#include "logit_processor.h"

I/O Contract

Inputs

Name	Type	Required	Description
logits	Tensor	Yes	Raw model output logits of shape (num_tokens, vocab_size), float32
generation_cfg	Array<GenerationConfig>	Yes	Per-sequence generation configs with temperature, penalties, logit bias
mstates	Array<RequestModelState>	Yes	Per-sequence model states tracking appeared tokens and grammar matchers
request_ids	Array<String>	Yes	Request IDs for event tracing
cum_num_token	std::vector<int>*	No	Cumulative token counts for multi-token sequences (nullptr for single-token)
draft_mstates	Array<RequestModelState>*	No	Draft model states for speculative decoding (nullptr if not used)
draft_token_indices	std::vector<std::vector<int>>*	No	Draft token tree indices for speculative decoding (nullptr if not used)
max_num_token	int	Yes (constructor)	Maximum number of tokens supported in a single batch
vocab_size	int	Yes (constructor)	Vocabulary size of the model

Outputs

Name	Type	Description
logits (modified in-place)	Tensor	The input logits tensor is modified in-place with applied biases, penalties, and masks
probs	Tensor	Probability distribution of shape (num_tokens, vocab_size) after temperature-scaled softmax

Usage Examples

// Create a logit processor
LogitProcessor logit_processor = model->CreateLogitProcessor(
    max_num_token, trace_recorder);

// After model forward pass produces logits:
// 1. Apply logit bias, penalties, and masks in-place
logit_processor->InplaceUpdateLogits(
    logits, generation_configs, mstates, request_ids,
    &cum_num_token, nullptr, nullptr);

// 2. Convert to probability distribution
Tensor probs = logit_processor->ComputeProbsFromLogits(
    logits, generation_configs, request_ids, &cum_num_token);

// 3. Feed probs to the sampler
SampleResult result = sampler->BatchSampleTokens(probs, ...);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment