Implementation:FlagOpen FlagEmbedding BGE Coder TripletGenerator

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Machine Learning, Data Generation, Large Language Models, Code Retrieval
Last Updated	2026-02-09 00:00 GMT

Overview

An LLM-powered triplet generator that creates query-positive-negative training examples for code embedding models through automated generation and quality control.

Description

TripletGenerator is a sophisticated class that orchestrates the generation of training data triplets (query, positive, negative) for code retrieval models. It leverages large language models to synthesize realistic queries from code snippets or generate code from textual descriptions, depending on the task type. The class implements intelligent task-specific generation strategies, including special handling for multi-step tasks like bug description retrieval and code modification.

The generator includes built-in quality control mechanisms that validate each generated triplet using LLM-based judgment, filtering out low-quality or mismatched pairs. It supports parallel processing with thread pools, per-document caching for efficiency, optional hard negative generation for improved training difficulty, and flexible few-shot learning with example pools.

Key features include handling of 63 different task types with specialized generation logic, support for both normal single-step and complex multi-step generation workflows, automatic arrangement of query-positive pairs based on task category (text2code, code2text, code2code, hybrid), and robust error handling with graceful degradation.

Usage

Use this class to generate synthetic training data for code embedding models by providing code snippets or text and letting the LLM create corresponding queries, positives, and optionally hard negatives with quality validation.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/BGE_Coder/data_generation/triplet_generator.py
Lines: 1-654

Signature

class TripletGenerator(LLM):
    def __init__(
        self,
        model: str = "Qwen2-5-Coder-32B-Instruct",
        model_type: str = "open-source",
        port: int = 8000,
        cache_dir: Optional[str] = None
    )

    def generate_triplets(
        self,
        data: dict,
        task: Task,
        examples_pool: Optional[List[dict]] = None,
        num_examples: int = 3,
        debug_mode: bool = False,
        **kwargs
    ) -> List[dict]

    def run(
        self,
        positives: List[dict],
        task_type: str,
        language: str = "en",
        code_language: str = "python",
        tgt_code_language: Optional[str] = None,
        examples_pool: Optional[List[dict]] = None,
        num_examples: int = 3,
        tqdm_desc: str = "Generating triplets",
        debug_mode: bool = False,
        gen_hard_neg: bool = False,
        num_negatives: int = 7,
        thread_count: int = 1,
        **kwargs
    ) -> List[dict]

    def gen_hard_negatives(
        self,
        result: dict,
        task: Task,
        num_negatives: int = 7,
        **kwargs
    ) -> dict

Import

from triplet_generator import TripletGenerator

I/O Contract

Inputs

Name	Type	Required	Description
positives	List[dict]	Yes	List of positive examples with 'text' field (code or text)
task_type	str	Yes	Task type identifier (e.g., "web_code_retrieval")
language	str	No	Natural language (default: "en")
code_language	str	No	Programming language (default: "python")
tgt_code_language	Optional[str]	No	Target code language for translation tasks
examples_pool	Optional[List[dict]]	No	Pool of few-shot examples
num_examples	int	No	Number of examples to use (default: 3)
debug_mode	bool	No	Include generation details in output (default: False)
gen_hard_neg	bool	No	Generate hard negatives (default: False)
num_negatives	int	No	Number of hard negatives (default: 7)
thread_count	int	No	Parallel threads (default: 1)

Outputs

Name	Type	Description
triplets	List[dict]	Generated triplets with prompt, query, pos, neg fields
prompt	str	Task instruction string
query	str	Generated query (text or code)
pos	List[str]	List with single positive example
neg	List[str]	List of negative examples (if gen_hard_neg=True)

Generation Strategies

Normal Task Generation

For standard single-step tasks: 1. Generate prompt using get_generation_prompt() 2. Call LLM to generate output 3. Arrange query and positive based on task type 4. Apply quality control filtering 5. Return validated triplet

# Text2Code: Generate query from code
# Input: code snippet
# Output: query = generated text, pos = input code

# Code2Text: Generate text from code
# Input: code snippet
# Output: query = input code, pos = generated text

# Code2Code: Generate code from code
# Input: code snippet
# Output: query = input code, pos = generated code

Code Context Retrieval

Special handling for code continuation: 1. Split code at anchor point (40%-70%) 2. Former part becomes query 3. Latter part becomes positive 4. No LLM generation needed 5. No quality control applied

Bug Description Retrieval

Two-step process: 1. Generate buggy version of code 2. Generate bug description from buggy code 3. Query = bug description, Positive = original (correct) code

Code Modification Retrieval

Two-step process requiring similar code: 1. Generate differences between input and similar code 2. Generate modification instruction from differences 3. Query = modification instruction + input code, Positive = similar code

Code Comparison Retrieval

Two-step process: 1. Generate comparison question from two code snippets 2. Generate comparison answer 3. Query = question + both code snippets, Positive = answer

Hybrid Two-Step Tasks

For tasks like refactoring, style guidelines, migration: 1. Generate natural language description/question 2. Generate code output based on description 3. Optionally reverse query and positive based on task

Quality Control

Every generated triplet (except code_context_retrieval) undergoes validation:

Validation Process

1. Generate quality control prompt with task, query, and positive 2. LLM judges if query-positive pair matches task requirements 3. Output: 0 (type mismatch), 1 (good match), 2 (bad match) 4. Only triplets with "1" in judgment are kept

Validation Criteria

Type Check: Verify query/positive match task category (text vs code)
Semantic Check: Verify query-positive relationship is correct
Task Alignment: Ensure pair aligns with specific task requirements

Hard Negative Generation

When gen_hard_neg=True: 1. Generate prompt describing how to create hard negative 2. Request LLM to generate num_negatives examples 3. Hard negatives should appear relevant but not truly match 4. Deduplicate generated negatives 5. Add to triplet['neg'] list

Caching Mechanism

Per-Document Caching

Cache key: MD5 hash of input text
Cache path: {cache_dir}/{md5_hash}.json
Cache content: List of generated triplets

Cache Behavior

Load from cache if exists
Generate only if cache miss
Update cache with hard negatives if needed
Persist immediately after generation

Parallel Processing

Uses ThreadPoolExecutor for concurrent generation:

Each thread processes one positive example
Thread-safe caching per document
Progress bar with tqdm
Configurable thread_count

Usage Examples

from triplet_generator import TripletGenerator

# Initialize generator
generator = TripletGenerator(
    model="Qwen2.5-72B-Instruct",
    model_type="open-source",
    port=8000,
    cache_dir="/tmp/cache"
)

# Basic generation
positives = [
    {"text": "def add(a, b):\n    return a + b"},
    {"text": "class User:\n    def __init__(self, name):\n        self.name = name"}
]

triplets = generator.run(
    positives=positives,
    task_type="web_code_retrieval",
    language="en",
    code_language="python",
    thread_count=4
)

# Output:
# [
#   {
#     "prompt": "Given a web search query, retrieve relevant code...",
#     "query": "how to add two numbers in python",
#     "pos": ["def add(a, b):\n    return a + b"],
#     "neg": []
#   },
#   ...
# ]

# With few-shot examples
examples = [
    {
        "input": "def multiply(x, y):\n    return x * y",
        "output": "python function to multiply two numbers"
    }
]

triplets = generator.run(
    positives=positives,
    task_type="code_summary_retrieval",
    examples_pool=examples,
    num_examples=2,
    language="en",
    code_language="python"
)

# With hard negatives
triplets = generator.run(
    positives=positives,
    task_type="api_usage_retrieval",
    gen_hard_neg=True,
    num_negatives=7,
    language="en",
    code_language="python"
)

# Output now includes neg field:
# {
#   "query": "how to use the add function",
#   "pos": ["def add(a, b):\n    return a + b"],
#   "neg": [
#     "def subtract(a, b): ...",
#     "def multiply(a, b): ...",
#     ...
#   ]
# }

# Code translation task
positives = [{"text": "def greet():\n    print('Hello')"}]

triplets = generator.run(
    positives=positives,
    task_type="code_translation_retrieval",
    language="en",
    code_language="python",
    tgt_code_language="java"
)

# Debug mode to see generation details
triplets = generator.run(
    positives=positives,
    task_type="bug_desc_retrieval",
    debug_mode=True,
    language="en",
    code_language="python"
)

# Output includes extra fields:
# {
#   "generation_prompt": "Given a piece of Python code...",
#   "query": "The code doesn't handle null input",
#   "pos": ["def add(a, b):\n    return a + b"],
#   "judge": "1",
#   "judge_response": "Yes, the query describes a bug..."
# }

# Single triplet generation
data = {"text": "SELECT * FROM users WHERE age > 18"}
task = get_task("text2sql_retrieval", "en", "sql")

result = generator.generate_triplets(
    data=data,
    task=task,
    debug_mode=False
)

# Code modification task (requires similar code)
positives = [
    {
        "text": "def old_func(x):\n    return x * 2",
        "similar": ["def new_func(x):\n    return x * 3"]
    }
]

triplets = generator.run(
    positives=positives,
    task_type="code_modification_retrieval",
    language="en",
    code_language="python"
)

Task-Specific Methods

_gen_for_normal_task

Handles standard single-step generation for most tasks.

_gen_for_code_context_retrieval

Splits code into prefix and suffix without LLM generation.

_gen_for_bug_desc_retrieval

Generates buggy code first, then bug description.

_gen_for_code_modification_retrieval

Compares two code versions and generates modification instructions.

_gen_for_code_comparison_retrieval

Generates comparison question and answer for two code snippets.

_gen_for_two_step_not_use_last

Two-step generation where second step doesn't include first step output.

_gen_for_two_step_use_last

Two-step generation where second step includes first step output.

_arrange_query_and_pos

Determines which generated text is query vs positive based on task type.

Error Handling

Graceful handling of LLM generation failures
Warning messages for exceptions
Returns empty list if generation fails
Continues processing other positives even if one fails
Validates response format before processing

Performance Optimization

Per-document caching avoids redundant generation
Thread pool for parallel processing
Batch processing with progress tracking
Configurable LLM parameters for speed/quality tradeoff
Optional removal of LLM reasoning traces

Integration with LLM Base Class

Inherits from LLM class:

chat() method for LLM interaction
Model management (open-source or proprietary)
Token management and generation parameters
Support for multiple response generation (for hard negatives)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment