Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU RoPE

From Leeroopedia


Knowledge Sources
Domains Machine Learning, CPU Kernels
Last Updated 2026-02-10 00:00 GMT

Overview

Implements CPU-optimized Rotary Position Embedding (RoPE) kernels supporting multiple tensor layouts (3D, 4D) and embedding styles (standard interleaved and NeoX-style split-half) for position encoding in transformer models.

Description

This file provides three internal kernel implementations:

  • rotary_embedding_3D_kernel_impl -- Handles the standard 3D tensor layout [num_tokens, num_heads, head_size], applying cosine-sine rotation pairs element-wise to query and key tensors. It parallelizes across num_tokens * num_heads and looks up position-dependent cos/sin values from a precomputed cos_sin_cache. Uses scalar element-by-element rotation: out1 = in1 * cos - in2 * sin, out2 = in2 * cos + in1 * sin.
  • rotary_embedding_neox_4D_kernel_impl -- Handles the 4D layout [batch, seq_len, num_heads, head_size] with NeoX-style embedding where the rotary dimension is split into two halves (first half and second half) rather than interleaved pairs. Uses SIMD vectorization with at::vec::Vectorized for the cos/sin multiply-add operations, processing bVecSize elements at a time with float32 intermediate computation.
  • rotary_embedding_4D_kernel_impl -- Standard interleaved rotation for 4D input tensors.

Both kernels support separate rotary dimensions for query and key, enabling architectures like DeepSeek MLA where key_rotary_dim differs from query_rotary_dim. The single public API function rotary_embedding_cpu dispatches to the appropriate kernel based on input dimensionality and the is_neox flag.

Usage

Use this kernel for all RoPE operations during CPU LLM inference. RoPE is used in virtually all modern LLM architectures (LLaMA, Mistral, Qwen, DeepSeek). The 3D layout is used during serving (token-by-token), while the 4D layout is used for batch prefill. The NeoX variant is needed for GPT-NeoX derived architectures.

Code Reference

Source Location

Signature

// Public API
std::tuple<at::Tensor, at::Tensor> rotary_embedding_cpu(
    at::Tensor& positions,
    at::Tensor& query,
    at::Tensor& key,
    int64_t head_size,
    at::Tensor& cos_sin_cache,
    bool is_neox);

// Internal kernels
template <typename scalar_t>
void rotary_embedding_3D_kernel_impl(
    scalar_t* __restrict__ query_out,
    scalar_t* __restrict__ key_out,
    int64_t* __restrict__ positions,
    scalar_t* __restrict__ query,
    scalar_t* __restrict__ key,
    scalar_t* __restrict__ cos_sin_cache,
    int64_t num_tokens, int64_t num_heads,
    int64_t num_kv_heads, int64_t head_size,
    int64_t rotary_dim,
    int64_t query_stride_s, int64_t query_out_stride_s,
    int64_t key_out_stride_s, int64_t key_stride_s,
    int64_t query_stride_h, int64_t query_out_stride_h);

template <typename scalar_t>
void rotary_embedding_neox_4D_kernel_impl(
    int64_t* __restrict__ positions,
    scalar_t* __restrict__ query,
    scalar_t* __restrict__ key,
    scalar_t* __restrict__ cos_sin_cache,
    int64_t rotary_dim,
    int64_t query_stride_b, int64_t query_stride_s,
    int64_t query_stride_h,
    int64_t key_stride_b, int64_t key_stride_s,
    int64_t key_stride_h,
    int64_t num_heads, int64_t num_kv_heads,
    int64_t head_size,
    int64_t batch_size, int64_t seq_len);

Import

#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
positions at::Tensor [num_tokens] Yes Token position indices for RoPE lookup (int64)
query at::Tensor Yes Query tensor in 2D [num_tokens, num_heads*head_size], 3D [num_tokens, num_heads, head_size], or 4D [batch, seq_len, num_heads, head_size]
key at::Tensor Yes Key tensor matching query layout
head_size int64_t Yes Size of each attention head
cos_sin_cache at::Tensor [max_pos, rotary_dim] Yes Precomputed cosine and sine values indexed by position
is_neox bool Yes Whether to use NeoX-style (split-half) rotation instead of interleaved

Outputs

Name Type Description
query_out at::Tensor Query with rotary position embedding applied (same shape as input query)
key_out at::Tensor Key with rotary position embedding applied (same shape as input key)

Usage Examples

// Standard RoPE for 3D tensors (serving mode)
auto [query_out, key_out] = rotary_embedding_cpu(
    positions,         // [num_tokens] int64
    query,             // [num_tokens, num_heads, head_size]
    key,               // [num_tokens, num_kv_heads, head_size]
    /*head_size=*/128,
    cos_sin_cache,     // [max_pos, rotary_dim]
    /*is_neox=*/false);

// NeoX-style RoPE for 4D tensors (batch prefill)
auto [q_out, k_out] = rotary_embedding_cpu(
    positions,         // [batch * seq_len] int64
    query,             // [batch, seq_len, num_heads, head_size]
    key,               // [batch, seq_len, num_kv_heads, head_size]
    /*head_size=*/128,
    cos_sin_cache,     // [max_pos, rotary_dim]
    /*is_neox=*/true);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment