Implementation:Datajuicer Data juicer TextTaggingByPromptMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for LLM-based text tagging via prompts provided by Data-Juicer.

Description

TextTaggingByPromptMapper is a mapper operator that generates text tags/labels using an LLM with customizable prompts, supporting both classification and binary detection workflows. It loads a Hugging Face model (default: Qwen/Qwen2.5-7B-Instruct) with optional VLLM acceleration for high-throughput inference, formats each text sample into a prompt containing the tag list and text, then collects the model's classification response and stores it in the sample's metadata fields.

Usage

Use when you need prompt-driven data annotation at scale, such as task category classification, AI self-identity detection, or any custom tagging task definable through prompts and tag lists.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/text_tagging_by_prompt_mapper.py

Signature

@OPERATORS.register_module("text_tagging_by_prompt_mapper")
class TextTaggingByPromptMapper(Mapper):
    def __init__(self, hf_model: str = "Qwen/Qwen2.5-7B-Instruct", trust_remote_code: bool = False, prompt: str = DEFAULT_CLASSIFICATION_PROMPT, tag_list: List[str] = DEFAULT_CLASSIFICATION_LIST, enable_vllm: bool = True, tensor_parallel_size: int = None, max_model_len: int = None, max_num_seqs: int = 256, sampling_params: Dict = None, *args, **kwargs):

Import

from data_juicer.ops.mapper.text_tagging_by_prompt_mapper import TextTaggingByPromptMapper

I/O Contract

Inputs

Name	Type	Required	Description
hf_model	str	No	HuggingFace model id (default: "Qwen/Qwen2.5-7B-Instruct")
trust_remote_code	bool	No	Whether to trust remote code of HF models (default: False)
prompt	str	No	Prompt template used to generate text tags
tag_list	List[str]	No	List of tagging output options
enable_vllm	bool	No	Whether to use VLLM for inference acceleration (default: True)
tensor_parallel_size	int	No	Number of GPUs for tensor parallelism (default: None)
max_model_len	int	No	Model context length (default: None, auto-derived)
max_num_seqs	int	No	Maximum sequences processed per iteration (default: 256)
sampling_params	Dict	No	Sampling parameters for text generation (default: None)

Outputs

Name	Type	Description
samples	Dict	Transformed samples with text_tags field populated

Usage Examples

process:
  - text_tagging_by_prompt_mapper:
      hf_model: "Qwen/Qwen2.5-7B-Instruct"
      enable_vllm: true

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment