Principle:LaurentMazare Tch rs NumPy Format IO

Knowledge Sources	LaurentMazare_Tch_rs NumPy .npy Format Specification
Domains	Data Serialization, Interoperability, File Formats
Last Updated	2026-02-08 00:00 GMT

Overview

The NumPy file format provides a standardized binary encoding for array data with self-describing headers, enabling efficient cross-language data exchange between scientific computing ecosystems.

Description

The NumPy file format is a widely-used binary format for persisting array (tensor) data. It comes in two forms:

.npy files store a single array with a self-describing header. The header contains all metadata needed to reconstruct the array:

Data type (dtype) -- The element type and byte order (e.g., float32 little-endian)
Shape -- The dimensions of the array as a tuple of integers
Memory order -- Whether the data is stored in C (row-major) or Fortran (column-major) order

The header is followed by the raw array data as a contiguous block of bytes. This simple structure makes the format easy to implement in any language, not just Python.

.npz files store multiple named arrays in a zip archive. Each entry in the archive is a .npy file with a name corresponding to the array's key. This enables saving and loading collections of arrays (e.g., all parameters of a neural network) in a single file.

The format's key advantages are:

Self-describing -- No external schema needed; the header encodes all metadata
Language-agnostic -- Simple enough to implement in any language with binary I/O
Efficient -- Data is stored in raw binary form with minimal overhead
Widely supported -- De facto standard for array interchange in scientific computing

Usage

Apply NumPy format I/O when:

Exchanging tensor data between different programming languages or frameworks
Persisting preprocessed datasets or model weights in a portable format
Loading data that was generated by Python/NumPy pipelines
Saving intermediate computation results for later analysis

Theoretical Basis

.npy File Structure

A .npy file consists of three sections:

Magic bytes -- 6 bytes identifying the format: Failed to parse (syntax error): {\displaystyle \texttt{\\x93NUMPY}}
Header -- A Python dictionary literal (as ASCII text) specifying:
- $descr$ : dtype string (e.g., '<f4' for little-endian float32)
- $fortran_order$ : boolean for memory layout
- $shape$ : tuple of dimension sizes
Data -- Raw binary array data, $\prod_{i} d_{i} \times sizeof (dtype)$ bytes

Dtype Encoding

The dtype string encodes byte order, type character, and byte size:

Byte order: '<' (little-endian), '>' (big-endian), '=' (native)
Type character: 'f' (float), 'i' (signed int), 'u' (unsigned int), 'b' (bool)
Byte count: number of bytes per element

Example: '<f4' means little-endian 32-bit float (4 bytes).

.npz Archive Structure

An .npz file is a standard zip archive where:

Each entry has a filename like $array_name . npy$
Each entry's content is a complete .npy file
Arrays can be accessed by name without loading the entire archive

Memory Layout

For an array with shape $(d_{1}, d_{2}, \dots, d_{n})$ :

C order (row-major): The last index varies fastest. Element at index $(i_{1}, \dots, i_{n})$ has byte offset:

$offset = (\sum_{k = 1}^{n} i_{k} \prod_{j = k + 1}^{n} d_{j}) \times sizeof (dtype)$

Fortran order (column-major): The first index varies fastest. Element byte offset:

$offset = (\sum_{k = 1}^{n} i_{k} \prod_{j = 1}^{k - 1} d_{j}) \times sizeof (dtype)$

Related Pages

Implementation:LaurentMazare_Tch_rs_Tensor_Npy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment