Implementation:Datajuicer Data juicer ImageFaceCountFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on the number of detected faces in images provided by Data-Juicer.
Description
ImageFaceCountFilter is a filter operator that keeps samples with the number of faces within a specific range. It uses an OpenCV Haar cascade classifier (default: haarcascade_frontalface_alt.xml) for face detection. The face counts are cached under the face_counts stats key. The operator supports 'any' (keep if any image meets the condition) and 'all' (keep only if all images meet the condition) strategies. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the number of human faces detected in images. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/image_face_count_filter.py
- Lines: 1-126
Signature
@UNFORKABLE.register_module("image_face_count_filter")
@OPERATORS.register_module("image_face_count_filter")
@LOADED_IMAGES.register_module("image_face_count_filter")
class ImageFaceCountFilter(Filter):
def __init__(
self,
cv_classifier: str = "",
min_face_count: int = 1,
max_face_count: int = 1,
any_or_all: str = "any",
*args,
**kwargs,
):
...
Import
from data_juicer.ops.filter.image_face_count_filter import ImageFaceCountFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cv_classifier | str | No | OpenCV classifier path for face detection. Default: haarcascade_frontalface_alt.xml |
| min_face_count | int | No | Minimum number of faces required for samples. Default: 1 |
| max_face_count | int | No | Maximum number of faces required for samples. Default: 1 |
| any_or_all | str | No | Keep strategy: 'any' or 'all' across images. Default: "any" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (face_counts) |
Usage Examples
YAML Configuration
process:
- image_face_count_filter:
min_face_count: 1
max_face_count: 1
any_or_all: "any"
Python API
from data_juicer.ops.filter.image_face_count_filter import ImageFaceCountFilter
op = ImageFaceCountFilter(min_face_count=1, max_face_count=1)
# Apply to dataset
result = dataset.process(op)