Implementation:Mlc ai Mlc llm Logit Processor
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Sampling, GPU Computing |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
LogitProcessor implements in-place logit transformations for the MLC LLM serving engine, applying logit bias, repetition/frequency/presence penalties, vocabulary bitmasks, and temperature-scaled softmax to produce sampling-ready probability distributions.
Description
The logit_processor.cc file implements the LogitProcessorImpl class, which is the core component responsible for transforming raw model logits into probability distributions suitable for token sampling. It operates in the mlc::llm::serve namespace.
The constructor allocates paired CPU-GPU auxiliary tensors for all intermediate data structures needed during logit processing: sequence IDs, position-to-sequence ID mappings, token IDs, token counts, logit biases, penalties, bitmasks, and temperatures. Each tensor is allocated on both the host (CPU) and device (GPU) for the copy-then-compute pattern. On CUDA/ROCm devices, a dedicated copy stream is created to overlap data transfer with computation, reducing latency.
InplaceUpdateLogits is the main entry point that applies three sequential transformations to the logit tensor:
- Logit Bias (
UpdateWithLogitBias): For each generation config that specifies a logit bias map, the method constructs sparse arrays of position-to-sequence mappings, token IDs, and bias values. These are copied to the GPU and applied via theapply_logit_bias_inplacekernel.
- Penalties (
UpdateWithPenalty): For sequences with non-default frequency penalty, presence penalty, or repetition penalty, the method constructs arrays of appeared token IDs, their counts, and the three penalty values per sequence. It handles draft tokens from speculative decoding by temporarily adding them to the appeared token tracking, processing, and then rolling back. Theapply_penalty_inplacekernel applies these penalties on the GPU.
- Vocabulary Mask (
UpdateWithMask): For sequences that require a next-token bitmask (typically from grammar-guided generation), the method retrieves bitmasks from the grammar matcher, handling draft token sequences by temporarily accepting and then rolling back draft tokens. The bitmask is a packed 32-bit integer representation withceil(vocab_size / 32)elements per sequence. Theapply_bitmask_inplacekernel sets masked positions to the minimum logit value. This step is deliberately placed last because the resulting minimum values could cause numerical issues (underflow) if further subtractions were applied.
ComputeProbsFromLogits converts the bias/penalty/mask-adjusted logits into probability distributions via temperature-scaled softmax. It constructs a temperature array from each sequence's generation config, copies it to the GPU, and calls the softmax kernel which takes logits of shape (n, 1, v) and temperatures of shape (n,) to produce probabilities of shape (n, v).
All methods use the dual-stream pattern: data is constructed on CPU, copied to GPU via the copy stream, the copy stream is synchronized with the compute stream, and then GPU kernels are launched on the compute stream.
Usage
Use LogitProcessor within the serving engine after model inference produces raw logits and before sampling. It is created via Model::CreateLogitProcessor and is called for both prefill and decode steps.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: cpp/serve/logit_processor.cc
- Lines: 1-506
Signature
class LogitProcessorImpl : public LogitProcessorObj {
public:
explicit LogitProcessorImpl(int max_num_token, int vocab_size,
FunctionTable* ft, DLDevice device,
Optional<EventTraceRecorder> trace_recorder);
~LogitProcessorImpl();
void InplaceUpdateLogits(
Tensor logits,
const Array<GenerationConfig>& generation_cfg,
const Array<RequestModelState>& mstates,
const Array<String>& request_ids,
const std::vector<int>* cum_num_token,
const Array<RequestModelState>* draft_mstates,
const std::vector<std::vector<int>>* draft_token_indices) final;
Tensor ComputeProbsFromLogits(
Tensor logits,
const Array<GenerationConfig>& generation_cfg,
const Array<String>& request_ids,
const std::vector<int>* cum_num_token) final;
};
// Constructor via the LogitProcessor wrapper
LogitProcessor::LogitProcessor(int max_num_token, int vocab_size,
FunctionTable* ft, DLDevice device,
Optional<EventTraceRecorder> trace_recorder);
Import
#include "logit_processor.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| logits | Tensor | Yes | Raw model output logits of shape (num_tokens, vocab_size), float32 |
| generation_cfg | Array<GenerationConfig> | Yes | Per-sequence generation configs with temperature, penalties, logit bias |
| mstates | Array<RequestModelState> | Yes | Per-sequence model states tracking appeared tokens and grammar matchers |
| request_ids | Array<String> | Yes | Request IDs for event tracing |
| cum_num_token | std::vector<int>* | No | Cumulative token counts for multi-token sequences (nullptr for single-token) |
| draft_mstates | Array<RequestModelState>* | No | Draft model states for speculative decoding (nullptr if not used) |
| draft_token_indices | std::vector<std::vector<int>>* | No | Draft token tree indices for speculative decoding (nullptr if not used) |
| max_num_token | int | Yes (constructor) | Maximum number of tokens supported in a single batch |
| vocab_size | int | Yes (constructor) | Vocabulary size of the model |
Outputs
| Name | Type | Description |
|---|---|---|
| logits (modified in-place) | Tensor | The input logits tensor is modified in-place with applied biases, penalties, and masks |
| probs | Tensor | Probability distribution of shape (num_tokens, vocab_size) after temperature-scaled softmax |
Usage Examples
// Create a logit processor
LogitProcessor logit_processor = model->CreateLogitProcessor(
max_num_token, trace_recorder);
// After model forward pass produces logits:
// 1. Apply logit bias, penalties, and masks in-place
logit_processor->InplaceUpdateLogits(
logits, generation_configs, mstates, request_ids,
&cum_num_token, nullptr, nullptr);
// 2. Convert to probability distribution
Tensor probs = logit_processor->ComputeProbsFromLogits(
logits, generation_configs, request_ids, &cum_num_token);
// 3. Feed probs to the sampler
SampleResult result = sampler->BatchSampleTokens(probs, ...);