Implementation:InternLM Lmdeploy Core Sync

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Synchronization
Last Updated	2026-02-07 15:00 GMT

Overview

GPU semaphore primitives using PTX acquire/release memory ordering for inter-CTA synchronization.

Description

This header implements a lightweight semaphore mechanism for synchronizing CUDA thread blocks (CTAs) via global memory. sem_fetch() performs an acquire-ordered global load of a lock variable, reading the semaphore state with GPU-scope visibility (SM70+) or cache-global ordering (older architectures). sem_wait() spins until all threads in the block observe the expected status value, using __syncthreads_and to ensure collective agreement. sem_wait_many() is a variant that waits until a specified count of threads observe a truthy state using __syncthreads_count. sem_post() issues a __syncthreads memory fence followed by a release-ordered global store to signal other CTAs. This pattern enables pipelined execution across CTAs in split-K GEMM or multi-stage kernels.

Usage

Use these semaphore primitives when implementing multi-CTA cooperation patterns such as split-K GEMM reduction, where one CTA must wait for another to finish writing partial results before accumulating them.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/core/sync.h

Signature

__device__ int sem_fetch(int* lock, bool pred);
__device__ void sem_wait(int* lock, int status, bool pred);
__device__ void sem_wait_many(int* lock, int count, bool pred);
__device__ void sem_post(int* lock, int status, bool pred);

Import

#include "src/turbomind/kernels/core/sync.h"

I/O Contract

Inputs

Name	Type	Required	Description
lock	int*	Yes	Pointer to global memory semaphore variable
status	int	Yes	Expected value to wait for or value to post
count	int	Yes	Number of threads that must observe truthy state (sem_wait_many)
pred	bool	Yes	Whether this thread participates in the memory operation

Outputs

Name	Type	Description
sem_fetch return	int	Current value of the semaphore lock

Usage Examples

using namespace turbomind;

// Wait for predecessor CTA to signal completion
bool is_leader = threadIdx.x == 0;
sem_wait(semaphore_ptr, expected_phase, is_leader);

// Perform work...

// Signal successor CTA
sem_post(semaphore_ptr, next_phase, is_leader);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment