Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer MetaTagsAggregator

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Aggregation
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for merging and consolidating similar meta tags into unified categories provided by Data-Juicer.

Description

MetaTagsAggregator collects tags from sample metadata, counts their frequencies, and presents them in a markdown table format to an LLM (default: gpt-4o). The LLM analyzes tag similarity and maps each original tag to a merged category. It supports two scenarios: mapping to predefined target tags (with an optional "miscellaneous" category), or generating new categories based on tag similarity and frequency. Output is parsed via a regex pattern to extract tag mappings, which are then applied to update each sample's meta tags. Uses Chinese-language prompts with detailed examples for tag consolidation.

Usage

Use when you need to normalize and consolidate fragmented or semantically similar tags in your dataset metadata into coherent categories, reducing tag duplication and improving data organization.

Code Reference

Source Location

Signature

@OPERATORS.register_module("meta_tags_aggregator")
class MetaTagsAggregator(Aggregator):
    def __init__(self, api_model: str = "gpt-4o",
                 meta_tag_key: str = MetaKeys.dialog_sentiment_labels,
                 target_tags: Optional[List[str]] = None,
                 *, api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 system_prompt: Optional[str] = None,
                 input_template: Optional[str] = None,
                 target_tag_template: Optional[str] = None,
                 tag_template: Optional[str] = None,
                 output_pattern: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.aggregator.meta_tags_aggregator import MetaTagsAggregator

I/O Contract

Inputs

Name Type Required Description
api_model str No API model name. Default: "gpt-4o"
meta_tag_key str No Key of the meta tag to be mapped. Default: dialog_sentiment_labels
target_tags List[str] No Predefined tags to map to. Default: None (auto-generate categories)
api_endpoint str No URL endpoint for the API
try_num PositiveInt No Number of retry attempts. Default: 3

Outputs

Name Type Description
sample[Fields.meta][meta_tag_key] str or list Updated meta tags mapped to consolidated categories

Usage Examples

process:
  - meta_tags_aggregator:
      api_model: "gpt-4o"
      target_tags: ["technology", "health", "other"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment