Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Huggingface Datasets Batch Size Optimization

From Leeroopedia
Knowledge Sources
Domains Optimization, Data_Processing
Last Updated 2026-02-14 19:00 GMT

Overview

The HuggingFace Datasets library implements an automatic record batch sizing optimization that selects different Arrow write batch sizes based on the feature types present in a dataset. Video datasets use a batch size of 10, image/audio/binary datasets use 100, and text/numeric datasets default to 1000. This tiered approach prevents Apache Arrow buffer overflows during dataset writing while keeping throughput high for lightweight data types.

Description

When the ArrowWriter is instantiated and no explicit writer_batch_size is provided, the library inspects the dataset's feature schema and selects an appropriate batch size through a three-step fallback chain:

  1. Use the caller-supplied writer_batch_size if one was provided.
  2. Call get_arrow_writer_batch_size_from_features() which walks the feature tree and returns the smallest batch size that matches any detected media type.
  3. Fall back to config.DEFAULT_MAX_BATCH_SIZE (1000) if no special feature type was found.

The get_arrow_writer_batch_size_from_features() function uses a visitor pattern (_visit) to traverse all nested features. It starts with batch_size = np.inf and applies min() against each type-specific constant. This means that if a dataset contains both images and video, the smallest applicable value (10 for video) wins. The constants are:

  • Video: ARROW_RECORD_BATCH_SIZE_FOR_VIDEO_DATASETS = 10
  • Image: ARROW_RECORD_BATCH_SIZE_FOR_IMAGE_DATASETS = 100
  • Audio: ARROW_RECORD_BATCH_SIZE_FOR_AUDIO_DATASETS = 100
  • Binary: ARROW_RECORD_BATCH_SIZE_FOR_BINARY_DATASETS = 100
  • Text/Numeric (default): DEFAULT_MAX_BATCH_SIZE = 1000

Beyond Arrow record batches, the library also controls Parquet row group sizing through get_writer_batch_size_from_data_size(), which targets a maximum of 100MB uncompressed per row group. Row group size matters because reading a single row from a Parquet file requires loading its entire row group into memory. Similarly, Parquet shard files target a maximum of 500MB each.

Additionally, the OptimizedTypedSequence class performs integer type optimization on well-known columns. For example, attention_mask is cast to int8 (binary tensor), input_ids to int32 (vocabulary sizes never exceed 1M), and token_type_ids to int8. If the data does not fit the narrower type, it falls back to int64 gracefully.

Usage

This optimization activates automatically during any operation that writes Arrow or Parquet files:

  • download_and_prepare(): Dataset builders set DEFAULT_WRITER_BATCH_SIZE = None, which delegates batch size selection to ArrowWriter.
  • Dataset.map(): When mapping over a dataset produces new Arrow files, the writer inherits the adaptive batch size.
  • Dataset.push_to_hub(): Parquet shard files are written with row group sizes controlled by get_writer_batch_size_from_data_size() and feature-type row group overrides.
  • Dataset.save_to_disk(): Arrow files written to local storage use the same batch size logic.

Users can override the behavior by explicitly setting writer_batch_size in builder configs or as a parameter to map().

The Insight (Rule of Thumb)

  • Action: Let the library auto-select batch sizes by leaving writer_batch_size=None (the default).
  • Value: Video=10, Image/Audio/Binary=100, Text/Numeric=1000. Parquet row groups target 100MB, shards target 500MB.
  • Trade-off: Smaller batch sizes for media data types reduce peak memory usage at the cost of slightly more I/O operations. Larger row groups improve sequential read throughput but increase memory cost for random access.

Code Evidence

Batch Size Selection by Feature Type

From src/datasets/arrow_writer.py lines 54-90:

def get_arrow_writer_batch_size_from_features(features: Optional[Features]) -> Optional[int]:
    """
    Get the writer_batch_size that defines the maximum record batch size in the arrow files based on configuration values.
    The default value is 100 for image/audio datasets and 10 for videos.
    This allows to avoid overflows in arrow buffers.
    """
    if not features:
        return None

    batch_size = np.inf

    def set_batch_size(feature: FeatureType) -> None:
        nonlocal batch_size
        if isinstance(feature, Image) and config.ARROW_RECORD_BATCH_SIZE_FOR_IMAGE_DATASETS is not None:
            batch_size = min(batch_size, config.ARROW_RECORD_BATCH_SIZE_FOR_IMAGE_DATASETS)
        elif isinstance(feature, Audio) and config.ARROW_RECORD_BATCH_SIZE_FOR_AUDIO_DATASETS is not None:
            batch_size = min(batch_size, config.ARROW_RECORD_BATCH_SIZE_FOR_AUDIO_DATASETS)
        elif isinstance(feature, Video) and config.ARROW_RECORD_BATCH_SIZE_FOR_VIDEO_DATASETS is not None:
            batch_size = min(batch_size, config.ARROW_RECORD_BATCH_SIZE_FOR_VIDEO_DATASETS)
        elif (
            isinstance(feature, Value)
            and feature.dtype == "binary"
            and config.ARROW_RECORD_BATCH_SIZE_FOR_BINARY_DATASETS is not None
        ):
            batch_size = min(batch_size, config.ARROW_RECORD_BATCH_SIZE_FOR_BINARY_DATASETS)

    _visit(features, set_batch_size)

    return None if batch_size is np.inf else batch_size

Configuration Constants

From src/datasets/config.py lines 186-209:

# Batch size constants. For more info, see:
# https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations)
DEFAULT_MAX_BATCH_SIZE = 1000

# Max uncompressed shard size in bytes (e.g. to shard parquet datasets in push_to_hub or download_and_prepare)
MAX_SHARD_SIZE = "500MB"

# Max uncompressed row group size in bytes (e.g. for parquet files in push_to_hub or download_and_prepare)
MAX_ROW_GROUP_SIZE = "100MB"

ARROW_RECORD_BATCH_SIZE_FOR_AUDIO_DATASETS = 100
ARROW_RECORD_BATCH_SIZE_FOR_IMAGE_DATASETS = 100
ARROW_RECORD_BATCH_SIZE_FOR_BINARY_DATASETS = 100
ARROW_RECORD_BATCH_SIZE_FOR_VIDEO_DATASETS = 10

ArrowWriter Fallback Chain

From src/datasets/arrow_writer.py lines 452-456:

self.writer_batch_size = (
    writer_batch_size
    or get_arrow_writer_batch_size_from_features(self._features)
    or config.DEFAULT_MAX_BATCH_SIZE
)

Builder Default

From src/datasets/builder.py lines 294-298:

# Default batch size used by the ArrowWriter
# It defines the number of samples that are kept in memory before writing them
# and also the length of the arrow chunks
# None means that the ArrowWriter will use its default value
DEFAULT_WRITER_BATCH_SIZE = None

Parquet Row Group Sizing

From src/datasets/arrow_writer.py lines 133-154:

def get_writer_batch_size_from_data_size(num_rows: int, num_bytes: int) -> int:
    """
    Get the writer_batch_size that defines the maximum row group size in the parquet files.
    The default in `datasets` is aiming for row groups of maximum 100MB uncompressed.
    This allows to optimize random access to parquet file, since accessing 1 row requires
    to read its entire row group.

    This can be improved to get optimized size for querying/iterating
    but at least it matches the dataset viewer expectations on HF.
    """
    return max(1, num_rows * convert_file_size_to_int(config.MAX_ROW_GROUP_SIZE) // num_bytes) if num_bytes > 0 else 1

Integer Type Optimization

From src/datasets/arrow_writer.py lines 384-403:

class OptimizedTypedSequence(TypedSequence):
    def __init__(
        self,
        data,
        type: Optional[FeatureType] = None,
        try_type: Optional[FeatureType] = None,
        col: Optional[str] = None,
        optimized_int_type: Optional[FeatureType] = None,
    ):
        optimized_int_type_by_col = {
            "attention_mask": Value("int8"),  # binary tensor
            "special_tokens_mask": Value("int8"),
            "input_ids": Value("int32"),  # typical vocab size: 0-50k (max ~500k, never > 1M)
            "token_type_ids": Value(
                "int8"
            ),  # binary mask; some (XLNetModel) use an additional token represented by a 2
        }
        if type is None and try_type is None:
            optimized_int_type = optimized_int_type_by_col.get(col, None)
        super().__init__(data, type=type, try_type=try_type, optimized_int_type=optimized_int_type)

Reasoning

The tiered batch size approach exists because media data (images, audio, video) consumes dramatically more memory per example than text or numeric data. A single high-resolution image can be several megabytes, and a video clip can be tens or hundreds of megabytes. Apache Arrow uses contiguous memory buffers for record batches, and these buffers have a hard 2GB limit. Writing 1000 video frames into a single record batch would easily exceed this limit and cause an ArrowInvalid: overflow error.

By reducing the batch size to 10 for video and 100 for images/audio, the library keeps each record batch well under the 2GB Arrow buffer ceiling. The min() selection logic ensures that mixed-type datasets (e.g., a dataset with both text and image columns) use the most conservative batch size, since the largest column dictates peak memory usage.

The Parquet row group sizing targets 100MB uncompressed because row groups are the unit of random access in Parquet files. When any single row is read, its entire row group must be loaded. Smaller row groups reduce the I/O amplification of random access reads, which is critical for the HuggingFace Dataset Viewer that serves individual examples on demand. The 500MB shard size limit controls the maximum size of individual Parquet files pushed to the Hub, balancing download granularity against file count.

The integer type optimization in OptimizedTypedSequence is a complementary memory optimization targeting NLP workloads. Columns like attention_mask only contain 0s and 1s, so storing them as int8 instead of int64 reduces memory by 8x. The fallback to int64 on "not in range" errors ensures correctness is never sacrificed for efficiency.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment