Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ImageShapeFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on image dimensions (width and height) provided by Data-Juicer.

Description

ImageShapeFilter is a filter operator that keeps samples with image shape (width, height) within specific ranges. It checks both the width and height of each image against configurable minimum and maximum thresholds. The image dimensions are stored under the image_width and image_height stats keys. The operator supports 'any' (keep if any image meets both width and height criteria) and 'all' (keep only if all images meet both criteria) strategies. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the pixel dimensions of images. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Signature

@OPERATORS.register_module("image_shape_filter")
@LOADED_IMAGES.register_module("image_shape_filter")
class ImageShapeFilter(Filter):
    def __init__(
        self,
        min_width: int = 1,
        max_width: int = sys.maxsize,
        min_height: int = 1,
        max_height: int = sys.maxsize,
        any_or_all: str = "any",
        *args,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.image_shape_filter import ImageShapeFilter

I/O Contract

Inputs

Name Type Required Description
min_width int No The minimum width to keep samples. Default: 1
max_width int No The maximum width to keep samples. Default: sys.maxsize
min_height int No The minimum height to keep samples. Default: 1
max_height int No The maximum height to keep samples. Default: sys.maxsize
any_or_all str No Keep strategy: 'any' or 'all' across images. Default: "any"

Outputs

Name Type Description
samples Dict Filtered samples with stats field updated (image_width, image_height)

Usage Examples

YAML Configuration

process:
  - image_shape_filter:
      min_width: 100
      max_width: 4096
      min_height: 100
      max_height: 4096
      any_or_all: "any"

Python API

from data_juicer.ops.filter.image_shape_filter import ImageShapeFilter

op = ImageShapeFilter(min_width=100, max_width=4096, min_height=100, max_height=4096)
# Apply to dataset
result = dataset.process(op)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment