Heuristic:Huggingface Optimum Dummy Input Shape Defaults

Knowledge Sources	Huggingface Optimum Input Generators
Domains	Model_Export, Optimization
Last Updated	2026-02-15 00:00 GMT

Overview

Default dummy input shapes for model export: batch_size=2, sequence_length=16, image 64x64x3, audio 16000 samples at 80 mel bins, with model-specific overrides for SpeechT5 (batch_size=1) and Musicgen (pad token filtering hack).

Description

The Optimum model export system generates dummy inputs to trace model computation graphs. These inputs use carefully chosen default shapes defined in `DEFAULT_DUMMY_SHAPES`. The shapes are intentionally small (e.g., sequence_length=16 instead of 512) to minimize memory during tracing while still exercising all dynamic dimensions. Several model architectures require special overrides due to framework limitations or architectural peculiarities.

Usage

Apply this heuristic when debugging model export failures or shape mismatch errors. If a model export fails with unexpected shapes, check whether the model requires custom dummy input shapes. Also use these defaults when writing new exporter configs or testing export compatibility.

The Insight (Rule of Thumb)

Action: Use the `DEFAULT_DUMMY_SHAPES` dictionary as the baseline for all export dummy inputs.
Value:
- `batch_size`: 2 (tests batching behavior)
- `sequence_length`: 16 (minimal but exercises dynamic dims)
- `num_choices`: 4 (for multiple-choice tasks)
- `width` / `height`: 64 (small images for fast tracing)
- `num_channels`: 3 (RGB only; grayscale not yet supported)
- `feature_size`: 80 (mel spectrogram bins)
- `audio_sequence_length`: 16000 (1 second at 16kHz)
- `nb_max_frames`: 3000 (audio frame limit)
Trade-off: Smaller shapes trace faster but may miss shape-dependent edge cases. Larger shapes are more thorough but use more memory.

Model-specific overrides:

SpeechT5: `batch_size` hardcoded to 1 because Transformers does not support batch inference for SpeechT5. The spectrogram first axis length (20) is arbitrary and dynamic.
Musicgen: Uses `sequence_length` as a hack for the guidance scale input because pad tokens are filtered out upstream.
Vision models: Both `image_size` and `input_size` attributes are checked, with fallback handling for tuple vs list formats.

Reasoning

The default values represent a practical balance between:

Memory efficiency: Small shapes minimize VRAM during tracing (important for CI/CD and resource-constrained environments).
Dynamic axis coverage: Non-trivial values (e.g., batch_size=2 instead of 1) ensure the tracer correctly identifies dynamic dimensions.
Standard conventions: Audio at 16kHz (16000 samples/sec), 80 mel bins, and 3-channel RGB are industry standards.

The model-specific overrides exist because certain models have framework-level limitations that cannot be worked around generically.

Code evidence from `optimum/utils/input_generators.py:44-59`:

DEFAULT_DUMMY_SHAPES = {
    "batch_size": 2,
    "sequence_length": 16,
    "num_choices": 4,
    # image
    "width": 64,
    "height": 64,
    "num_channels": 3,
    "point_batch_size": 3,
    "nb_points_per_image": 2,
    "visual_seq_length": 16,
    # audio
    "feature_size": 80,
    "nb_max_frames": 3000,
    "audio_sequence_length": 16000,
}

SpeechT5 batch limitation from `optimum/utils/input_generators.py:1405`:

self.batch_size = 1  # TODO: SpeechT5 does not support batch inference in Transformers for now.

Musicgen pad token hack from `optimum/utils/input_generators.py:1558`:

# Kind of a hack to use `self.sequence_length` here, for Musicgen pad tokens are filtered out

Spectrogram magic number from `optimum/utils/input_generators.py:1417`:

shape = [20, self.num_mel_bins]  # NOTE: the first axis length is arbitrary and dynamic

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment