Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Coder TripletGenerator

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Data Generation, Large Language Models, Code Retrieval
Last Updated 2026-02-09 00:00 GMT

Overview

An LLM-powered triplet generator that creates query-positive-negative training examples for code embedding models through automated generation and quality control.

Description

TripletGenerator is a sophisticated class that orchestrates the generation of training data triplets (query, positive, negative) for code retrieval models. It leverages large language models to synthesize realistic queries from code snippets or generate code from textual descriptions, depending on the task type. The class implements intelligent task-specific generation strategies, including special handling for multi-step tasks like bug description retrieval and code modification.

The generator includes built-in quality control mechanisms that validate each generated triplet using LLM-based judgment, filtering out low-quality or mismatched pairs. It supports parallel processing with thread pools, per-document caching for efficiency, optional hard negative generation for improved training difficulty, and flexible few-shot learning with example pools.

Key features include handling of 63 different task types with specialized generation logic, support for both normal single-step and complex multi-step generation workflows, automatic arrangement of query-positive pairs based on task category (text2code, code2text, code2code, hybrid), and robust error handling with graceful degradation.

Usage

Use this class to generate synthetic training data for code embedding models by providing code snippets or text and letting the LLM create corresponding queries, positives, and optionally hard negatives with quality validation.

Code Reference

Source Location

Signature

class TripletGenerator(LLM):
    def __init__(
        self,
        model: str = "Qwen2-5-Coder-32B-Instruct",
        model_type: str = "open-source",
        port: int = 8000,
        cache_dir: Optional[str] = None
    )

    def generate_triplets(
        self,
        data: dict,
        task: Task,
        examples_pool: Optional[List[dict]] = None,
        num_examples: int = 3,
        debug_mode: bool = False,
        **kwargs
    ) -> List[dict]

    def run(
        self,
        positives: List[dict],
        task_type: str,
        language: str = "en",
        code_language: str = "python",
        tgt_code_language: Optional[str] = None,
        examples_pool: Optional[List[dict]] = None,
        num_examples: int = 3,
        tqdm_desc: str = "Generating triplets",
        debug_mode: bool = False,
        gen_hard_neg: bool = False,
        num_negatives: int = 7,
        thread_count: int = 1,
        **kwargs
    ) -> List[dict]

    def gen_hard_negatives(
        self,
        result: dict,
        task: Task,
        num_negatives: int = 7,
        **kwargs
    ) -> dict

Import

from triplet_generator import TripletGenerator

I/O Contract

Inputs

Name Type Required Description
positives List[dict] Yes List of positive examples with 'text' field (code or text)
task_type str Yes Task type identifier (e.g., "web_code_retrieval")
language str No Natural language (default: "en")
code_language str No Programming language (default: "python")
tgt_code_language Optional[str] No Target code language for translation tasks
examples_pool Optional[List[dict]] No Pool of few-shot examples
num_examples int No Number of examples to use (default: 3)
debug_mode bool No Include generation details in output (default: False)
gen_hard_neg bool No Generate hard negatives (default: False)
num_negatives int No Number of hard negatives (default: 7)
thread_count int No Parallel threads (default: 1)

Outputs

Name Type Description
triplets List[dict] Generated triplets with prompt, query, pos, neg fields
prompt str Task instruction string
query str Generated query (text or code)
pos List[str] List with single positive example
neg List[str] List of negative examples (if gen_hard_neg=True)

Generation Strategies

Normal Task Generation

For standard single-step tasks: 1. Generate prompt using get_generation_prompt() 2. Call LLM to generate output 3. Arrange query and positive based on task type 4. Apply quality control filtering 5. Return validated triplet

# Text2Code: Generate query from code
# Input: code snippet
# Output: query = generated text, pos = input code

# Code2Text: Generate text from code
# Input: code snippet
# Output: query = input code, pos = generated text

# Code2Code: Generate code from code
# Input: code snippet
# Output: query = input code, pos = generated code

Code Context Retrieval

Special handling for code continuation: 1. Split code at anchor point (40%-70%) 2. Former part becomes query 3. Latter part becomes positive 4. No LLM generation needed 5. No quality control applied

Bug Description Retrieval

Two-step process: 1. Generate buggy version of code 2. Generate bug description from buggy code 3. Query = bug description, Positive = original (correct) code

Code Modification Retrieval

Two-step process requiring similar code: 1. Generate differences between input and similar code 2. Generate modification instruction from differences 3. Query = modification instruction + input code, Positive = similar code

Code Comparison Retrieval

Two-step process: 1. Generate comparison question from two code snippets 2. Generate comparison answer 3. Query = question + both code snippets, Positive = answer

Hybrid Two-Step Tasks

For tasks like refactoring, style guidelines, migration: 1. Generate natural language description/question 2. Generate code output based on description 3. Optionally reverse query and positive based on task

Quality Control

Every generated triplet (except code_context_retrieval) undergoes validation:

Validation Process

1. Generate quality control prompt with task, query, and positive 2. LLM judges if query-positive pair matches task requirements 3. Output: 0 (type mismatch), 1 (good match), 2 (bad match) 4. Only triplets with "1" in judgment are kept

Validation Criteria

  • Type Check: Verify query/positive match task category (text vs code)
  • Semantic Check: Verify query-positive relationship is correct
  • Task Alignment: Ensure pair aligns with specific task requirements

Hard Negative Generation

When gen_hard_neg=True: 1. Generate prompt describing how to create hard negative 2. Request LLM to generate num_negatives examples 3. Hard negatives should appear relevant but not truly match 4. Deduplicate generated negatives 5. Add to triplet['neg'] list

Caching Mechanism

Per-Document Caching

  • Cache key: MD5 hash of input text
  • Cache path: {cache_dir}/{md5_hash}.json
  • Cache content: List of generated triplets

Cache Behavior

  • Load from cache if exists
  • Generate only if cache miss
  • Update cache with hard negatives if needed
  • Persist immediately after generation

Parallel Processing

Uses ThreadPoolExecutor for concurrent generation:

  • Each thread processes one positive example
  • Thread-safe caching per document
  • Progress bar with tqdm
  • Configurable thread_count

Usage Examples

from triplet_generator import TripletGenerator

# Initialize generator
generator = TripletGenerator(
    model="Qwen2.5-72B-Instruct",
    model_type="open-source",
    port=8000,
    cache_dir="/tmp/cache"
)

# Basic generation
positives = [
    {"text": "def add(a, b):\n    return a + b"},
    {"text": "class User:\n    def __init__(self, name):\n        self.name = name"}
]

triplets = generator.run(
    positives=positives,
    task_type="web_code_retrieval",
    language="en",
    code_language="python",
    thread_count=4
)

# Output:
# [
#   {
#     "prompt": "Given a web search query, retrieve relevant code...",
#     "query": "how to add two numbers in python",
#     "pos": ["def add(a, b):\n    return a + b"],
#     "neg": []
#   },
#   ...
# ]

# With few-shot examples
examples = [
    {
        "input": "def multiply(x, y):\n    return x * y",
        "output": "python function to multiply two numbers"
    }
]

triplets = generator.run(
    positives=positives,
    task_type="code_summary_retrieval",
    examples_pool=examples,
    num_examples=2,
    language="en",
    code_language="python"
)

# With hard negatives
triplets = generator.run(
    positives=positives,
    task_type="api_usage_retrieval",
    gen_hard_neg=True,
    num_negatives=7,
    language="en",
    code_language="python"
)

# Output now includes neg field:
# {
#   "query": "how to use the add function",
#   "pos": ["def add(a, b):\n    return a + b"],
#   "neg": [
#     "def subtract(a, b): ...",
#     "def multiply(a, b): ...",
#     ...
#   ]
# }

# Code translation task
positives = [{"text": "def greet():\n    print('Hello')"}]

triplets = generator.run(
    positives=positives,
    task_type="code_translation_retrieval",
    language="en",
    code_language="python",
    tgt_code_language="java"
)

# Debug mode to see generation details
triplets = generator.run(
    positives=positives,
    task_type="bug_desc_retrieval",
    debug_mode=True,
    language="en",
    code_language="python"
)

# Output includes extra fields:
# {
#   "generation_prompt": "Given a piece of Python code...",
#   "query": "The code doesn't handle null input",
#   "pos": ["def add(a, b):\n    return a + b"],
#   "judge": "1",
#   "judge_response": "Yes, the query describes a bug..."
# }

# Single triplet generation
data = {"text": "SELECT * FROM users WHERE age > 18"}
task = get_task("text2sql_retrieval", "en", "sql")

result = generator.generate_triplets(
    data=data,
    task=task,
    debug_mode=False
)

# Code modification task (requires similar code)
positives = [
    {
        "text": "def old_func(x):\n    return x * 2",
        "similar": ["def new_func(x):\n    return x * 3"]
    }
]

triplets = generator.run(
    positives=positives,
    task_type="code_modification_retrieval",
    language="en",
    code_language="python"
)

Task-Specific Methods

_gen_for_normal_task

Handles standard single-step generation for most tasks.

_gen_for_code_context_retrieval

Splits code into prefix and suffix without LLM generation.

_gen_for_bug_desc_retrieval

Generates buggy code first, then bug description.

_gen_for_code_modification_retrieval

Compares two code versions and generates modification instructions.

_gen_for_code_comparison_retrieval

Generates comparison question and answer for two code snippets.

_gen_for_two_step_not_use_last

Two-step generation where second step doesn't include first step output.

_gen_for_two_step_use_last

Two-step generation where second step includes first step output.

_arrange_query_and_pos

Determines which generated text is query vs positive based on task type.

Error Handling

  • Graceful handling of LLM generation failures
  • Warning messages for exceptions
  • Returns empty list if generation fails
  • Continues processing other positives even if one fails
  • Validates response format before processing

Performance Optimization

  • Per-document caching avoids redundant generation
  • Thread pool for parallel processing
  • Batch processing with progress tracking
  • Configurable LLM parameters for speed/quality tradeoff
  • Optional removal of LLM reasoning traces

Integration with LLM Base Class

Inherits from LLM class:

  • chat() method for LLM interaction
  • Model management (open-source or proprietary)
  • Token management and generation parameters
  • Support for multiple response generation (for hard negatives)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment