Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:InternLM Lmdeploy Core Sync

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Synchronization
Last Updated 2026-02-07 15:00 GMT

Overview

GPU semaphore primitives using PTX acquire/release memory ordering for inter-CTA synchronization.

Description

This header implements a lightweight semaphore mechanism for synchronizing CUDA thread blocks (CTAs) via global memory. sem_fetch() performs an acquire-ordered global load of a lock variable, reading the semaphore state with GPU-scope visibility (SM70+) or cache-global ordering (older architectures). sem_wait() spins until all threads in the block observe the expected status value, using __syncthreads_and to ensure collective agreement. sem_wait_many() is a variant that waits until a specified count of threads observe a truthy state using __syncthreads_count. sem_post() issues a __syncthreads memory fence followed by a release-ordered global store to signal other CTAs. This pattern enables pipelined execution across CTAs in split-K GEMM or multi-stage kernels.

Usage

Use these semaphore primitives when implementing multi-CTA cooperation patterns such as split-K GEMM reduction, where one CTA must wait for another to finish writing partial results before accumulating them.

Code Reference

Source Location

Signature

__device__ int sem_fetch(int* lock, bool pred);
__device__ void sem_wait(int* lock, int status, bool pred);
__device__ void sem_wait_many(int* lock, int count, bool pred);
__device__ void sem_post(int* lock, int status, bool pred);

Import

#include "src/turbomind/kernels/core/sync.h"

I/O Contract

Inputs

Name Type Required Description
lock int* Yes Pointer to global memory semaphore variable
status int Yes Expected value to wait for or value to post
count int Yes Number of threads that must observe truthy state (sem_wait_many)
pred bool Yes Whether this thread participates in the memory operation

Outputs

Name Type Description
sem_fetch return int Current value of the semaphore lock

Usage Examples

using namespace turbomind;

// Wait for predecessor CTA to signal completion
bool is_leader = threadIdx.x == 0;
sem_wait(semaphore_ptr, expected_phase, is_leader);

// Perform work...

// Signal successor CTA
sem_post(semaphore_ptr, next_phase, is_leader);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment