Implementation:InternLM Lmdeploy Core Sync
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Synchronization |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
GPU semaphore primitives using PTX acquire/release memory ordering for inter-CTA synchronization.
Description
This header implements a lightweight semaphore mechanism for synchronizing CUDA thread blocks (CTAs) via global memory. sem_fetch() performs an acquire-ordered global load of a lock variable, reading the semaphore state with GPU-scope visibility (SM70+) or cache-global ordering (older architectures). sem_wait() spins until all threads in the block observe the expected status value, using __syncthreads_and to ensure collective agreement. sem_wait_many() is a variant that waits until a specified count of threads observe a truthy state using __syncthreads_count. sem_post() issues a __syncthreads memory fence followed by a release-ordered global store to signal other CTAs. This pattern enables pipelined execution across CTAs in split-K GEMM or multi-stage kernels.
Usage
Use these semaphore primitives when implementing multi-CTA cooperation patterns such as split-K GEMM reduction, where one CTA must wait for another to finish writing partial results before accumulating them.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/core/sync.h
Signature
__device__ int sem_fetch(int* lock, bool pred);
__device__ void sem_wait(int* lock, int status, bool pred);
__device__ void sem_wait_many(int* lock, int count, bool pred);
__device__ void sem_post(int* lock, int status, bool pred);
Import
#include "src/turbomind/kernels/core/sync.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lock | int* | Yes | Pointer to global memory semaphore variable |
| status | int | Yes | Expected value to wait for or value to post |
| count | int | Yes | Number of threads that must observe truthy state (sem_wait_many) |
| pred | bool | Yes | Whether this thread participates in the memory operation |
Outputs
| Name | Type | Description |
|---|---|---|
| sem_fetch return | int | Current value of the semaphore lock |
Usage Examples
using namespace turbomind;
// Wait for predecessor CTA to signal completion
bool is_leader = threadIdx.x == 0;
sem_wait(semaphore_ptr, expected_phase, is_leader);
// Perform work...
// Signal successor CTA
sem_post(semaphore_ptr, next_phase, is_leader);