Implementation:Datajuicer Data juicer TagsSpecifiedFieldSelector
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Selection |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering samples based on matching field values to target tags provided by Data-Juicer.
Description
TagsSpecifiedFieldSelector extends Selector and filters dataset samples by keeping only those whose specified field value matches one of a predefined set of target tags. It iterates over all samples, extracts the value at the dot-separated multi-level field key, and checks membership against a set of target tags. Samples with matching values have their indices collected and used to select the filtered subset via dataset.select(). The selection is case-sensitive. The field value must be a string, number, or None type. If the dataset has fewer than two samples or if field_key is empty, the dataset is returned unchanged.
Usage
Use when you need tag-based or category-based filtering for data processing pipelines, selecting data belonging to specific categories, labels, or groups.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/selector/tags_specified_field_selector.py
Signature
@OPERATORS.register_module("tags_specified_field_selector")
class TagsSpecifiedFieldSelector(Selector):
def __init__(self, field_key: str = "",
target_tags: List[str] = None,
*args, **kwargs):
Import
from data_juicer.ops.selector.tags_specified_field_selector import TagsSpecifiedFieldSelector
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| field_key | str | No | Target field key. Multi-level fields separated by '.'. Default: "" |
| target_tags | List[str] | Yes | List of tags to match against the field value |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Filtered dataset containing only samples whose field values match a target tag |
Usage Examples
process:
- tags_specified_field_selector:
field_key: "__dj__stats__.lang"
target_tags: ["en", "zh"]