Implementation:Datajuicer Data juicer MetaTagsAggregator
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Aggregation |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for merging and consolidating similar meta tags into unified categories provided by Data-Juicer.
Description
MetaTagsAggregator collects tags from sample metadata, counts their frequencies, and presents them in a markdown table format to an LLM (default: gpt-4o). The LLM analyzes tag similarity and maps each original tag to a merged category. It supports two scenarios: mapping to predefined target tags (with an optional "miscellaneous" category), or generating new categories based on tag similarity and frequency. Output is parsed via a regex pattern to extract tag mappings, which are then applied to update each sample's meta tags. Uses Chinese-language prompts with detailed examples for tag consolidation.
Usage
Use when you need to normalize and consolidate fragmented or semantically similar tags in your dataset metadata into coherent categories, reducing tag duplication and improving data organization.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/aggregator/meta_tags_aggregator.py
Signature
@OPERATORS.register_module("meta_tags_aggregator")
class MetaTagsAggregator(Aggregator):
def __init__(self, api_model: str = "gpt-4o",
meta_tag_key: str = MetaKeys.dialog_sentiment_labels,
target_tags: Optional[List[str]] = None,
*, api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
system_prompt: Optional[str] = None,
input_template: Optional[str] = None,
target_tag_template: Optional[str] = None,
tag_template: Optional[str] = None,
output_pattern: Optional[str] = None,
try_num: PositiveInt = 3,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.aggregator.meta_tags_aggregator import MetaTagsAggregator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name. Default: "gpt-4o" |
| meta_tag_key | str | No | Key of the meta tag to be mapped. Default: dialog_sentiment_labels |
| target_tags | List[str] | No | Predefined tags to map to. Default: None (auto-generate categories) |
| api_endpoint | str | No | URL endpoint for the API |
| try_num | PositiveInt | No | Number of retry attempts. Default: 3 |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[Fields.meta][meta_tag_key] | str or list | Updated meta tags mapped to consolidated categories |
Usage Examples
process:
- meta_tags_aggregator:
api_model: "gpt-4o"
target_tags: ["technology", "health", "other"]