Implementation:FlagOpen FlagEmbedding BGE Coder TripletGenerator
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Generation, Large Language Models, Code Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An LLM-powered triplet generator that creates query-positive-negative training examples for code embedding models through automated generation and quality control.
Description
TripletGenerator is a sophisticated class that orchestrates the generation of training data triplets (query, positive, negative) for code retrieval models. It leverages large language models to synthesize realistic queries from code snippets or generate code from textual descriptions, depending on the task type. The class implements intelligent task-specific generation strategies, including special handling for multi-step tasks like bug description retrieval and code modification.
The generator includes built-in quality control mechanisms that validate each generated triplet using LLM-based judgment, filtering out low-quality or mismatched pairs. It supports parallel processing with thread pools, per-document caching for efficiency, optional hard negative generation for improved training difficulty, and flexible few-shot learning with example pools.
Key features include handling of 63 different task types with specialized generation logic, support for both normal single-step and complex multi-step generation workflows, automatic arrangement of query-positive pairs based on task category (text2code, code2text, code2code, hybrid), and robust error handling with graceful degradation.
Usage
Use this class to generate synthetic training data for code embedding models by providing code snippets or text and letting the LLM create corresponding queries, positives, and optionally hard negatives with quality validation.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_Coder/data_generation/triplet_generator.py
- Lines: 1-654
Signature
class TripletGenerator(LLM):
def __init__(
self,
model: str = "Qwen2-5-Coder-32B-Instruct",
model_type: str = "open-source",
port: int = 8000,
cache_dir: Optional[str] = None
)
def generate_triplets(
self,
data: dict,
task: Task,
examples_pool: Optional[List[dict]] = None,
num_examples: int = 3,
debug_mode: bool = False,
**kwargs
) -> List[dict]
def run(
self,
positives: List[dict],
task_type: str,
language: str = "en",
code_language: str = "python",
tgt_code_language: Optional[str] = None,
examples_pool: Optional[List[dict]] = None,
num_examples: int = 3,
tqdm_desc: str = "Generating triplets",
debug_mode: bool = False,
gen_hard_neg: bool = False,
num_negatives: int = 7,
thread_count: int = 1,
**kwargs
) -> List[dict]
def gen_hard_negatives(
self,
result: dict,
task: Task,
num_negatives: int = 7,
**kwargs
) -> dict
Import
from triplet_generator import TripletGenerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| positives | List[dict] | Yes | List of positive examples with 'text' field (code or text) |
| task_type | str | Yes | Task type identifier (e.g., "web_code_retrieval") |
| language | str | No | Natural language (default: "en") |
| code_language | str | No | Programming language (default: "python") |
| tgt_code_language | Optional[str] | No | Target code language for translation tasks |
| examples_pool | Optional[List[dict]] | No | Pool of few-shot examples |
| num_examples | int | No | Number of examples to use (default: 3) |
| debug_mode | bool | No | Include generation details in output (default: False) |
| gen_hard_neg | bool | No | Generate hard negatives (default: False) |
| num_negatives | int | No | Number of hard negatives (default: 7) |
| thread_count | int | No | Parallel threads (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| triplets | List[dict] | Generated triplets with prompt, query, pos, neg fields |
| prompt | str | Task instruction string |
| query | str | Generated query (text or code) |
| pos | List[str] | List with single positive example |
| neg | List[str] | List of negative examples (if gen_hard_neg=True) |
Generation Strategies
Normal Task Generation
For standard single-step tasks: 1. Generate prompt using get_generation_prompt() 2. Call LLM to generate output 3. Arrange query and positive based on task type 4. Apply quality control filtering 5. Return validated triplet
# Text2Code: Generate query from code
# Input: code snippet
# Output: query = generated text, pos = input code
# Code2Text: Generate text from code
# Input: code snippet
# Output: query = input code, pos = generated text
# Code2Code: Generate code from code
# Input: code snippet
# Output: query = input code, pos = generated code
Code Context Retrieval
Special handling for code continuation: 1. Split code at anchor point (40%-70%) 2. Former part becomes query 3. Latter part becomes positive 4. No LLM generation needed 5. No quality control applied
Bug Description Retrieval
Two-step process: 1. Generate buggy version of code 2. Generate bug description from buggy code 3. Query = bug description, Positive = original (correct) code
Code Modification Retrieval
Two-step process requiring similar code: 1. Generate differences between input and similar code 2. Generate modification instruction from differences 3. Query = modification instruction + input code, Positive = similar code
Code Comparison Retrieval
Two-step process: 1. Generate comparison question from two code snippets 2. Generate comparison answer 3. Query = question + both code snippets, Positive = answer
Hybrid Two-Step Tasks
For tasks like refactoring, style guidelines, migration: 1. Generate natural language description/question 2. Generate code output based on description 3. Optionally reverse query and positive based on task
Quality Control
Every generated triplet (except code_context_retrieval) undergoes validation:
Validation Process
1. Generate quality control prompt with task, query, and positive 2. LLM judges if query-positive pair matches task requirements 3. Output: 0 (type mismatch), 1 (good match), 2 (bad match) 4. Only triplets with "1" in judgment are kept
Validation Criteria
- Type Check: Verify query/positive match task category (text vs code)
- Semantic Check: Verify query-positive relationship is correct
- Task Alignment: Ensure pair aligns with specific task requirements
Hard Negative Generation
When gen_hard_neg=True: 1. Generate prompt describing how to create hard negative 2. Request LLM to generate num_negatives examples 3. Hard negatives should appear relevant but not truly match 4. Deduplicate generated negatives 5. Add to triplet['neg'] list
Caching Mechanism
Per-Document Caching
- Cache key: MD5 hash of input text
- Cache path: {cache_dir}/{md5_hash}.json
- Cache content: List of generated triplets
Cache Behavior
- Load from cache if exists
- Generate only if cache miss
- Update cache with hard negatives if needed
- Persist immediately after generation
Parallel Processing
Uses ThreadPoolExecutor for concurrent generation:
- Each thread processes one positive example
- Thread-safe caching per document
- Progress bar with tqdm
- Configurable thread_count
Usage Examples
from triplet_generator import TripletGenerator
# Initialize generator
generator = TripletGenerator(
model="Qwen2.5-72B-Instruct",
model_type="open-source",
port=8000,
cache_dir="/tmp/cache"
)
# Basic generation
positives = [
{"text": "def add(a, b):\n return a + b"},
{"text": "class User:\n def __init__(self, name):\n self.name = name"}
]
triplets = generator.run(
positives=positives,
task_type="web_code_retrieval",
language="en",
code_language="python",
thread_count=4
)
# Output:
# [
# {
# "prompt": "Given a web search query, retrieve relevant code...",
# "query": "how to add two numbers in python",
# "pos": ["def add(a, b):\n return a + b"],
# "neg": []
# },
# ...
# ]
# With few-shot examples
examples = [
{
"input": "def multiply(x, y):\n return x * y",
"output": "python function to multiply two numbers"
}
]
triplets = generator.run(
positives=positives,
task_type="code_summary_retrieval",
examples_pool=examples,
num_examples=2,
language="en",
code_language="python"
)
# With hard negatives
triplets = generator.run(
positives=positives,
task_type="api_usage_retrieval",
gen_hard_neg=True,
num_negatives=7,
language="en",
code_language="python"
)
# Output now includes neg field:
# {
# "query": "how to use the add function",
# "pos": ["def add(a, b):\n return a + b"],
# "neg": [
# "def subtract(a, b): ...",
# "def multiply(a, b): ...",
# ...
# ]
# }
# Code translation task
positives = [{"text": "def greet():\n print('Hello')"}]
triplets = generator.run(
positives=positives,
task_type="code_translation_retrieval",
language="en",
code_language="python",
tgt_code_language="java"
)
# Debug mode to see generation details
triplets = generator.run(
positives=positives,
task_type="bug_desc_retrieval",
debug_mode=True,
language="en",
code_language="python"
)
# Output includes extra fields:
# {
# "generation_prompt": "Given a piece of Python code...",
# "query": "The code doesn't handle null input",
# "pos": ["def add(a, b):\n return a + b"],
# "judge": "1",
# "judge_response": "Yes, the query describes a bug..."
# }
# Single triplet generation
data = {"text": "SELECT * FROM users WHERE age > 18"}
task = get_task("text2sql_retrieval", "en", "sql")
result = generator.generate_triplets(
data=data,
task=task,
debug_mode=False
)
# Code modification task (requires similar code)
positives = [
{
"text": "def old_func(x):\n return x * 2",
"similar": ["def new_func(x):\n return x * 3"]
}
]
triplets = generator.run(
positives=positives,
task_type="code_modification_retrieval",
language="en",
code_language="python"
)
Task-Specific Methods
_gen_for_normal_task
Handles standard single-step generation for most tasks.
_gen_for_code_context_retrieval
Splits code into prefix and suffix without LLM generation.
_gen_for_bug_desc_retrieval
Generates buggy code first, then bug description.
_gen_for_code_modification_retrieval
Compares two code versions and generates modification instructions.
_gen_for_code_comparison_retrieval
Generates comparison question and answer for two code snippets.
_gen_for_two_step_not_use_last
Two-step generation where second step doesn't include first step output.
_gen_for_two_step_use_last
Two-step generation where second step includes first step output.
_arrange_query_and_pos
Determines which generated text is query vs positive based on task type.
Error Handling
- Graceful handling of LLM generation failures
- Warning messages for exceptions
- Returns empty list if generation fails
- Continues processing other positives even if one fails
- Validates response format before processing
Performance Optimization
- Per-document caching avoids redundant generation
- Thread pool for parallel processing
- Batch processing with progress tracking
- Configurable LLM parameters for speed/quality tradeoff
- Optional removal of LLM reasoning traces
Integration with LLM Base Class
Inherits from LLM class:
- chat() method for LLM interaction
- Model management (open-source or proprietary)
- Token management and generation parameters
- Support for multiple response generation (for hard negatives)