Implementation:NVIDIA DALI ScratchCopyImpl
| Knowledge Sources | |
|---|---|
| Domains | Kernels, GPU_Computing |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Implements the ToContiguousHostMem and ToContiguousGPUMem functions that pack multiple collections into a single contiguous scratchpad-allocated buffer with proper alignment.
Description
The scratch_copy_impl.h header provides the implementation of two key functions declared in context.h: ToContiguousHostMem and ToContiguousGPUMem. These functions take a variadic set of collections, compute aligned offsets for packing them into a single contiguous buffer, allocate that buffer from a Scratchpad, and copy the data. The result is a tuple of typed pointers to each collection's data within the contiguous buffer.
The internal detail namespace contains the building blocks: GetCollectionOffsets computes the byte offset for each collection within the buffer respecting each type's alignment requirements; copy_to_buffer copies collection contents into the buffer at the computed offsets; and GetCollectionPtrs reconstructs typed pointers from the base address and offsets. A variadic_max utility is used to determine the maximum alignment across all collection element types.
ToContiguousHostMem allocates a single host buffer from the scratchpad and copies all collections into it using std::copy. ToContiguousGPUMem takes a more sophisticated approach: it allocates a temporary pinned host buffer (using mm::alloc_raw_async_unique), copies the collections there, then allocates a device buffer from the scratchpad, and performs a single cudaMemcpyAsync to transfer all data to the GPU at once. This minimizes the number of host-to-device transfers by packing everything into one contiguous copy. Both functions enforce at compile time via static_assert that all collection element types are trivially copyable.
Usage
Use ToContiguousHostMem when a kernel needs multiple host-side arrays packed into a single contiguous allocation, avoiding fragmented scratchpad usage. Use ToContiguousGPUMem when multiple small host collections (such as shape arrays, offsets, or parameter vectors) need to be transferred to the GPU -- packing them into a single transfer is significantly more efficient than issuing separate cudaMemcpyAsync calls for each. These functions are accessible through the Scratchpad::ToContiguousHost and Scratchpad::ToContiguousGPU convenience methods.
Code Reference
Source Location
- Repository: NVIDIA_DALI
- File: dali/kernels/scratch_copy_impl.h
- Lines: 1-152
Signature
namespace detail {
inline void copy_to_buffer(char *buffer, const size_t *offsets);
template <typename Collection, typename... Collections>
void copy_to_buffer(char *buffer,
const size_t *offsets,
const Collection &c,
const Collections &... tail);
inline void GetCollectionOffsets(size_t base, size_t *offsets);
template <typename Collection, typename... Collections>
void GetCollectionOffsets(size_t base, size_t *offsets,
const Collection &c,
const Collections &...tail);
constexpr std::tuple<> GetCollectionPtrs(void *base, const size_t *offsets);
template <typename Collection, typename... Collections>
auto GetCollectionPtrs(void *base, const size_t *offsets,
const Collection &c,
const Collections &...tail);
template <typename T>
T variadic_max(T t);
template <typename T0, typename... T>
auto variadic_max(T0 t0, T... tail);
} // namespace detail
template <typename... Collections>
std::tuple<std::remove_cv_t<element_t<Collections>>*...>
ToContiguousHostMem(Scratchpad &scratchpad, const Collections &... c);
template <typename... Collections>
std::tuple<std::remove_cv_t<element_t<Collections>>*...>
ToContiguousGPUMem(Scratchpad &scratchpad, cudaStream_t stream, const Collections &... c);
Import
#include "dali/kernels/scratch_copy_impl.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| scratchpad | Scratchpad& |
Yes | Scratchpad used to allocate the contiguous buffer (host or device) |
| stream | cudaStream_t |
Yes (GPU only) | CUDA stream for the async host-to-device memory copy |
| c / collections | Collections&... |
Yes | One or more collections of trivially copyable elements to pack into a contiguous buffer |
Outputs
| Name | Type | Description |
|---|---|---|
| (tuple of pointers) | std::tuple<T*...> |
Tuple of typed pointers, one per input collection, pointing to the corresponding data within the contiguous buffer |
Usage Examples
Packing Multiple Arrays to GPU
#include "dali/kernels/context.h"
void MyKernel::Run(KernelContext &ctx,
const OutListGPU<float, 3> &out,
const InListGPU<float, 3> &in) {
std::vector<int> offsets = {0, 100, 200, 300};
std::vector<float> scales = {1.0f, 2.0f, 0.5f, 1.5f};
std::vector<int> flags = {1, 0, 1, 0};
// Pack all three arrays into one contiguous GPU buffer with a single transfer
auto [gpu_offsets, gpu_scales, gpu_flags] =
ctx.scratchpad->ToContiguousGPU(ctx.gpu.stream, offsets, scales, flags);
// gpu_offsets, gpu_scales, gpu_flags are device pointers
// ready to use in GPU kernels
}
Packing Arrays to Host Memory
#include "dali/kernels/context.h"
void ProcessOnHost(Scratchpad &scratch) {
std::vector<double> weights = {0.1, 0.2, 0.3};
std::vector<int> indices = {5, 10, 15};
// Allocate a single contiguous host buffer and copy both collections
auto [host_weights, host_indices] =
ToContiguousHostMem(scratch, weights, indices);
// host_weights and host_indices point into the same contiguous buffer
}