Heuristic:Huggingface Optimum Dummy Input Shape Defaults
| Knowledge Sources | |
|---|---|
| Domains | Model_Export, Optimization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Default dummy input shapes for model export: batch_size=2, sequence_length=16, image 64x64x3, audio 16000 samples at 80 mel bins, with model-specific overrides for SpeechT5 (batch_size=1) and Musicgen (pad token filtering hack).
Description
The Optimum model export system generates dummy inputs to trace model computation graphs. These inputs use carefully chosen default shapes defined in `DEFAULT_DUMMY_SHAPES`. The shapes are intentionally small (e.g., sequence_length=16 instead of 512) to minimize memory during tracing while still exercising all dynamic dimensions. Several model architectures require special overrides due to framework limitations or architectural peculiarities.
Usage
Apply this heuristic when debugging model export failures or shape mismatch errors. If a model export fails with unexpected shapes, check whether the model requires custom dummy input shapes. Also use these defaults when writing new exporter configs or testing export compatibility.
The Insight (Rule of Thumb)
- Action: Use the `DEFAULT_DUMMY_SHAPES` dictionary as the baseline for all export dummy inputs.
- Value:
- `batch_size`: 2 (tests batching behavior)
- `sequence_length`: 16 (minimal but exercises dynamic dims)
- `num_choices`: 4 (for multiple-choice tasks)
- `width` / `height`: 64 (small images for fast tracing)
- `num_channels`: 3 (RGB only; grayscale not yet supported)
- `feature_size`: 80 (mel spectrogram bins)
- `audio_sequence_length`: 16000 (1 second at 16kHz)
- `nb_max_frames`: 3000 (audio frame limit)
- Trade-off: Smaller shapes trace faster but may miss shape-dependent edge cases. Larger shapes are more thorough but use more memory.
Model-specific overrides:
- SpeechT5: `batch_size` hardcoded to 1 because Transformers does not support batch inference for SpeechT5. The spectrogram first axis length (20) is arbitrary and dynamic.
- Musicgen: Uses `sequence_length` as a hack for the guidance scale input because pad tokens are filtered out upstream.
- Vision models: Both `image_size` and `input_size` attributes are checked, with fallback handling for tuple vs list formats.
Reasoning
The default values represent a practical balance between:
- Memory efficiency: Small shapes minimize VRAM during tracing (important for CI/CD and resource-constrained environments).
- Dynamic axis coverage: Non-trivial values (e.g., batch_size=2 instead of 1) ensure the tracer correctly identifies dynamic dimensions.
- Standard conventions: Audio at 16kHz (16000 samples/sec), 80 mel bins, and 3-channel RGB are industry standards.
The model-specific overrides exist because certain models have framework-level limitations that cannot be worked around generically.
Code evidence from `optimum/utils/input_generators.py:44-59`:
DEFAULT_DUMMY_SHAPES = {
"batch_size": 2,
"sequence_length": 16,
"num_choices": 4,
# image
"width": 64,
"height": 64,
"num_channels": 3,
"point_batch_size": 3,
"nb_points_per_image": 2,
"visual_seq_length": 16,
# audio
"feature_size": 80,
"nb_max_frames": 3000,
"audio_sequence_length": 16000,
}
SpeechT5 batch limitation from `optimum/utils/input_generators.py:1405`:
self.batch_size = 1 # TODO: SpeechT5 does not support batch inference in Transformers for now.
Musicgen pad token hack from `optimum/utils/input_generators.py:1558`:
# Kind of a hack to use `self.sequence_length` here, for Musicgen pad tokens are filtered out
Spectrogram magic number from `optimum/utils/input_generators.py:1417`:
shape = [20, self.num_mel_bins] # NOTE: the first axis length is arbitrary and dynamic