Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer TextTaggingByPromptMapper

From Leeroopedia
Revision as of 12:23, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Datajuicer_Data_juicer_TextTaggingByPromptMapper.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for LLM-based text tagging via prompts provided by Data-Juicer.

Description

TextTaggingByPromptMapper is a mapper operator that generates text tags/labels using an LLM with customizable prompts, supporting both classification and binary detection workflows. It loads a Hugging Face model (default: Qwen/Qwen2.5-7B-Instruct) with optional VLLM acceleration for high-throughput inference, formats each text sample into a prompt containing the tag list and text, then collects the model's classification response and stores it in the sample's metadata fields.

Usage

Use when you need prompt-driven data annotation at scale, such as task category classification, AI self-identity detection, or any custom tagging task definable through prompts and tag lists.

Code Reference

Source Location

Signature

@OPERATORS.register_module("text_tagging_by_prompt_mapper")
class TextTaggingByPromptMapper(Mapper):
    def __init__(self, hf_model: str = "Qwen/Qwen2.5-7B-Instruct", trust_remote_code: bool = False, prompt: str = DEFAULT_CLASSIFICATION_PROMPT, tag_list: List[str] = DEFAULT_CLASSIFICATION_LIST, enable_vllm: bool = True, tensor_parallel_size: int = None, max_model_len: int = None, max_num_seqs: int = 256, sampling_params: Dict = None, *args, **kwargs):

Import

from data_juicer.ops.mapper.text_tagging_by_prompt_mapper import TextTaggingByPromptMapper

I/O Contract

Inputs

Name Type Required Description
hf_model str No HuggingFace model id (default: "Qwen/Qwen2.5-7B-Instruct")
trust_remote_code bool No Whether to trust remote code of HF models (default: False)
prompt str No Prompt template used to generate text tags
tag_list List[str] No List of tagging output options
enable_vllm bool No Whether to use VLLM for inference acceleration (default: True)
tensor_parallel_size int No Number of GPUs for tensor parallelism (default: None)
max_model_len int No Model context length (default: None, auto-derived)
max_num_seqs int No Maximum sequences processed per iteration (default: 256)
sampling_params Dict No Sampling parameters for text generation (default: None)

Outputs

Name Type Description
samples Dict Transformed samples with text_tags field populated

Usage Examples

process:
  - text_tagging_by_prompt_mapper:
      hf_model: "Qwen/Qwen2.5-7B-Instruct"
      enable_vllm: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment