Implementation:Datajuicer Data juicer ImageShapeFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on image dimensions (width and height) provided by Data-Juicer.
Description
ImageShapeFilter is a filter operator that keeps samples with image shape (width, height) within specific ranges. It checks both the width and height of each image against configurable minimum and maximum thresholds. The image dimensions are stored under the image_width and image_height stats keys. The operator supports 'any' (keep if any image meets both width and height criteria) and 'all' (keep only if all images meet both criteria) strategies. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the pixel dimensions of images. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/image_shape_filter.py
- Lines: 1-101
Signature
@OPERATORS.register_module("image_shape_filter")
@LOADED_IMAGES.register_module("image_shape_filter")
class ImageShapeFilter(Filter):
def __init__(
self,
min_width: int = 1,
max_width: int = sys.maxsize,
min_height: int = 1,
max_height: int = sys.maxsize,
any_or_all: str = "any",
*args,
**kwargs,
):
...
Import
from data_juicer.ops.filter.image_shape_filter import ImageShapeFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_width | int | No | The minimum width to keep samples. Default: 1 |
| max_width | int | No | The maximum width to keep samples. Default: sys.maxsize |
| min_height | int | No | The minimum height to keep samples. Default: 1 |
| max_height | int | No | The maximum height to keep samples. Default: sys.maxsize |
| any_or_all | str | No | Keep strategy: 'any' or 'all' across images. Default: "any" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (image_width, image_height) |
Usage Examples
YAML Configuration
process:
- image_shape_filter:
min_width: 100
max_width: 4096
min_height: 100
max_height: 4096
any_or_all: "any"
Python API
from data_juicer.ops.filter.image_shape_filter import ImageShapeFilter
op = ImageShapeFilter(min_width=100, max_width=4096, min_height=100, max_height=4096)
# Apply to dataset
result = dataset.process(op)