Implementation:NVIDIA DALI ScratchCopyImpl

Knowledge Sources	NVIDIA_DALI
Domains	Kernels, GPU_Computing
Last Updated	2026-02-08 16:00 GMT

Overview

Implements the ToContiguousHostMem and ToContiguousGPUMem functions that pack multiple collections into a single contiguous scratchpad-allocated buffer with proper alignment.

Description

The scratch_copy_impl.h header provides the implementation of two key functions declared in context.h: ToContiguousHostMem and ToContiguousGPUMem. These functions take a variadic set of collections, compute aligned offsets for packing them into a single contiguous buffer, allocate that buffer from a Scratchpad, and copy the data. The result is a tuple of typed pointers to each collection's data within the contiguous buffer.

The internal detail namespace contains the building blocks: GetCollectionOffsets computes the byte offset for each collection within the buffer respecting each type's alignment requirements; copy_to_buffer copies collection contents into the buffer at the computed offsets; and GetCollectionPtrs reconstructs typed pointers from the base address and offsets. A variadic_max utility is used to determine the maximum alignment across all collection element types.

ToContiguousHostMem allocates a single host buffer from the scratchpad and copies all collections into it using std::copy. ToContiguousGPUMem takes a more sophisticated approach: it allocates a temporary pinned host buffer (using mm::alloc_raw_async_unique), copies the collections there, then allocates a device buffer from the scratchpad, and performs a single cudaMemcpyAsync to transfer all data to the GPU at once. This minimizes the number of host-to-device transfers by packing everything into one contiguous copy. Both functions enforce at compile time via static_assert that all collection element types are trivially copyable.

Usage

Use ToContiguousHostMem when a kernel needs multiple host-side arrays packed into a single contiguous allocation, avoiding fragmented scratchpad usage. Use ToContiguousGPUMem when multiple small host collections (such as shape arrays, offsets, or parameter vectors) need to be transferred to the GPU -- packing them into a single transfer is significantly more efficient than issuing separate cudaMemcpyAsync calls for each. These functions are accessible through the Scratchpad::ToContiguousHost and Scratchpad::ToContiguousGPU convenience methods.

Code Reference

Source Location

Repository: NVIDIA_DALI
File: dali/kernels/scratch_copy_impl.h
Lines: 1-152

Signature

namespace detail {

inline void copy_to_buffer(char *buffer, const size_t *offsets);

template <typename Collection, typename... Collections>
void copy_to_buffer(char *buffer,
                    const size_t *offsets,
                    const Collection &c,
                    const Collections &... tail);

inline void GetCollectionOffsets(size_t base, size_t *offsets);

template <typename Collection, typename... Collections>
void GetCollectionOffsets(size_t base, size_t *offsets,
                          const Collection &c,
                          const Collections &...tail);

constexpr std::tuple<> GetCollectionPtrs(void *base, const size_t *offsets);

template <typename Collection, typename... Collections>
auto GetCollectionPtrs(void *base, const size_t *offsets,
                       const Collection &c,
                       const Collections &...tail);

template <typename T>
T variadic_max(T t);

template <typename T0, typename... T>
auto variadic_max(T0 t0, T... tail);

}  // namespace detail

template <typename... Collections>
std::tuple<std::remove_cv_t<element_t<Collections>>*...>
ToContiguousHostMem(Scratchpad &scratchpad, const Collections &... c);

template <typename... Collections>
std::tuple<std::remove_cv_t<element_t<Collections>>*...>
ToContiguousGPUMem(Scratchpad &scratchpad, cudaStream_t stream, const Collections &... c);

Import

#include "dali/kernels/scratch_copy_impl.h"

I/O Contract

Inputs

Name	Type	Required	Description
scratchpad	`Scratchpad&`	Yes	Scratchpad used to allocate the contiguous buffer (host or device)
stream	`cudaStream_t`	Yes (GPU only)	CUDA stream for the async host-to-device memory copy
c / collections	`Collections&...`	Yes	One or more collections of trivially copyable elements to pack into a contiguous buffer

Outputs

Name	Type	Description
(tuple of pointers)	`std::tuple<T*...>`	Tuple of typed pointers, one per input collection, pointing to the corresponding data within the contiguous buffer

Usage Examples

Packing Multiple Arrays to GPU

#include "dali/kernels/context.h"

void MyKernel::Run(KernelContext &ctx,
                   const OutListGPU<float, 3> &out,
                   const InListGPU<float, 3> &in) {
  std::vector<int> offsets = {0, 100, 200, 300};
  std::vector<float> scales = {1.0f, 2.0f, 0.5f, 1.5f};
  std::vector<int> flags = {1, 0, 1, 0};

  // Pack all three arrays into one contiguous GPU buffer with a single transfer
  auto [gpu_offsets, gpu_scales, gpu_flags] =
      ctx.scratchpad->ToContiguousGPU(ctx.gpu.stream, offsets, scales, flags);

  // gpu_offsets, gpu_scales, gpu_flags are device pointers
  // ready to use in GPU kernels
}

Packing Arrays to Host Memory

#include "dali/kernels/context.h"

void ProcessOnHost(Scratchpad &scratch) {
  std::vector<double> weights = {0.1, 0.2, 0.3};
  std::vector<int> indices = {5, 10, 15};

  // Allocate a single contiguous host buffer and copy both collections
  auto [host_weights, host_indices] =
      ToContiguousHostMem(scratch, weights, indices);

  // host_weights and host_indices point into the same contiguous buffer
}

Related Pages

Environment:NVIDIA_DALI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment