Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Datasets Immutable Format Setting

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Creating a new dataset view with a different output format without modifying the original dataset object.

Description

Immutable Format Setting is the practice of creating a formatted copy of a dataset rather than modifying the original in place. This approach preserves the original dataset's format configuration while producing a new dataset object that returns data in the desired framework format (e.g., PyTorch tensors, NumPy arrays). The original and formatted datasets share the same underlying Arrow data, so the memory overhead of the copy is minimal.

This immutable approach is preferred in functional-style code and when multiple format views of the same data are needed simultaneously. It avoids the pitfalls of in-place mutation, such as unexpected side effects when a dataset reference is shared across multiple parts of a codebase.

Usage

Use Immutable Format Setting when:

  • You need multiple format views of the same dataset (e.g., one for PyTorch training and one for NumPy analysis).
  • You want to preserve the original dataset's format configuration for later use.
  • You are writing library code that should not mutate datasets passed to it.
  • You prefer a functional programming style where operations return new objects rather than modifying existing ones.
  • You need to temporarily access data in a different format and then continue with the original format.

Theoretical Basis

Immutable Format Setting implements the copy-on-write pattern combined with view semantics. In functional programming, immutability is a core principle that enables safe sharing and compositional reasoning. By returning a new dataset object with different format settings but shared underlying data, this pattern achieves the safety of immutability without the memory cost of full data duplication. This is analogous to database views that present the same underlying data in different formats without duplicating storage.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment