Implementation:Datajuicer Data juicer TextTaggingByPromptMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for LLM-based text tagging via prompts provided by Data-Juicer.
Description
TextTaggingByPromptMapper is a mapper operator that generates text tags/labels using an LLM with customizable prompts, supporting both classification and binary detection workflows. It loads a Hugging Face model (default: Qwen/Qwen2.5-7B-Instruct) with optional VLLM acceleration for high-throughput inference, formats each text sample into a prompt containing the tag list and text, then collects the model's classification response and stores it in the sample's metadata fields.
Usage
Use when you need prompt-driven data annotation at scale, such as task category classification, AI self-identity detection, or any custom tagging task definable through prompts and tag lists.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/text_tagging_by_prompt_mapper.py
Signature
@OPERATORS.register_module("text_tagging_by_prompt_mapper")
class TextTaggingByPromptMapper(Mapper):
def __init__(self, hf_model: str = "Qwen/Qwen2.5-7B-Instruct", trust_remote_code: bool = False, prompt: str = DEFAULT_CLASSIFICATION_PROMPT, tag_list: List[str] = DEFAULT_CLASSIFICATION_LIST, enable_vllm: bool = True, tensor_parallel_size: int = None, max_model_len: int = None, max_num_seqs: int = 256, sampling_params: Dict = None, *args, **kwargs):
Import
from data_juicer.ops.mapper.text_tagging_by_prompt_mapper import TextTaggingByPromptMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_model | str | No | HuggingFace model id (default: "Qwen/Qwen2.5-7B-Instruct") |
| trust_remote_code | bool | No | Whether to trust remote code of HF models (default: False) |
| prompt | str | No | Prompt template used to generate text tags |
| tag_list | List[str] | No | List of tagging output options |
| enable_vllm | bool | No | Whether to use VLLM for inference acceleration (default: True) |
| tensor_parallel_size | int | No | Number of GPUs for tensor parallelism (default: None) |
| max_model_len | int | No | Model context length (default: None, auto-derived) |
| max_num_seqs | int | No | Maximum sequences processed per iteration (default: 256) |
| sampling_params | Dict | No | Sampling parameters for text generation (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with text_tags field populated |
Usage Examples
process:
- text_tagging_by_prompt_mapper:
hf_model: "Qwen/Qwen2.5-7B-Instruct"
enable_vllm: true