Implementation:FlagOpen FlagEmbedding BGE Coder Run Generation
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Generation, Code Retrieval, Natural Language Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
The main orchestration script for BGE Coder's training data generation pipeline, coordinating corpus loading, triplet generation, and data persistence for code embedding models.
Description
This script serves as the entry point for generating synthetic training data for the BGE Coder model. It orchestrates the entire data generation workflow by loading code corpora, generating query-positive-negative triplets using LLMs, applying quality control, and saving the results. The script supports multi-processing for parallel generation, caching for efficiency, and flexible configuration for different task types, languages, and code languages.
Key capabilities include handling 63 different code retrieval tasks across 22 natural languages and 20 programming languages, support for both open-source and proprietary LLMs (via vllm), intelligent corpus sampling based on document length, optional hard negative generation for improved training quality, and robust deduplication based on MD5 hashing.
The script implements special handling for tasks requiring similarity search (code modification and comparison tasks) and provides comprehensive argument parsing for fine-grained control over the generation process.
Usage
Use this script to generate training data for code embedding models by specifying task type, languages, corpus paths, and model configurations. It's designed to be run from the command line with extensive configuration options.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_Coder/data_generation/run_generation.py
- Lines: 1-368
Signature
def main(args):
"""Main orchestration function for data generation"""
pass
def gen_triplets(
model: str,
model_type: str,
port: int,
positives: List[dict],
task_type: str,
language: str,
code_language: str,
tgt_code_language: str,
examples_pool: Optional[List[dict]] = None,
num_examples: int = 3,
tqdm_desc: str = "Generating triplets",
thread_count: int = 1,
gen_cache_dir: Optional[str] = None,
debug_mode: bool = False,
gen_hard_neg: bool = False,
) -> list
def save_triplets(
triplets: list,
save_dir: str,
task_type: str,
language: str,
code_language: str,
tgt_code_language: Optional[str] = None
)
def get_save_path(
save_dir: str,
task_type: str,
language: str,
code_language: str,
tgt_code_language: Optional[str] = None
) -> str
Import
from run_generation import main, gen_triplets, save_triplets
# Typically run as a script: python run_generation.py --task_type web_code_retrieval ...
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| task_type | str | Yes | Task type from TaskType enum (e.g., "web_code_retrieval") |
| code_language | str | Yes | Programming language for corpus (e.g., "python") |
| corpus_root | str | Yes | Root directory containing code corpus files |
| save_dir | str | Yes | Directory to save generated triplets |
| language | str | No | Natural language (default: "en") |
| tgt_code_language | str | No | Target code language for translation tasks |
| model | str | No | LLM model name (default: "Qwen2.5-72B-Instruct") |
| model_type | str | No | Model type: "open-source" or proprietary (default: "open-source") |
| port | int | No | Port for vllm server (default: 8000) |
| num_samples | int | No | Number of samples to generate (default: -1 for all) |
| num_processes | int | No | Parallel processes (default: 1) |
| doc_length | str | No | Document length filter (default: "len_0_500") |
| examples_dir | str | No | Directory with few-shot examples |
| num_examples | int | No | Number of few-shot examples (default: 3) |
| max_corpus | int | No | Maximum corpus size (default: 500000) |
| sim_model_name | str | No | Model for similarity search in special tasks |
| gen_hard_neg | bool | No | Generate hard negatives (default: False) |
| seed | int | No | Random seed for reproducibility |
| overwrite | bool | No | Overwrite existing data (default: False) |
| debug_mode | bool | No | Enable debug output (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| triplets | List[dict] | Generated query-positive-negative triplets |
| output_file | JSONL | File at {save_dir}/{language}/{task_type}/{language}-{code_language}-triplets.jsonl |
Output File Format
Each line in the output JSONL file contains:
{
"prompt": "Task instruction string",
"query": "Generated query (text or code)",
"pos": ["Positive example (code or text)"],
"neg": ["Negative example 1", "Negative example 2", ...]
}
Workflow
Step 1: Argument Parsing
Parse command-line arguments including task configuration, model settings, corpus paths, and generation parameters.
Step 2: Corpus Loading
Use CorpusGenerator to:
- Load code files from corpus_root/code_language directory
- Filter by document length (e.g., len_0_500, len_500_1000)
- Sample up to max_corpus documents
- Optionally load external corpus from external_path
- Split into positives (primary corpus) and large_positives (for similarity search)
Step 3: Similarity Search (Special Tasks)
For code_modification_retrieval and code_comparison_retrieval:
- Use similarity model to find top-1 most similar code for each positive
- Store similar code in positives[i]['similar']
- Clear GPU cache after similarity computation
Step 4: Example Loading (Optional)
If examples_dir is provided:
- Load task-specific few-shot examples
- Sample up to 30 examples from the pool
- Examples guide LLM generation with expected format
Step 5: Triplet Generation
Use TripletGenerator to:
- Generate queries from positives using LLM
- Apply quality control filtering
- Generate hard negatives (if gen_hard_neg=True)
- Cache results per document (using MD5 hash)
- Support multi-processing for parallel generation
Step 6: Data Persistence
Save generated triplets:
- Deduplicate based on query and positive MD5 hashes
- Merge with existing data (if any)
- Save to JSONL format
- Organize by language and task type
Usage Examples
# Basic usage: Generate Python web code retrieval data
python run_generation.py \
--task_type web_code_retrieval \
--code_language python \
--corpus_root /data/code_corpus \
--save_dir /data/output \
--language en \
--model Qwen2.5-72B-Instruct \
--num_samples 1000
# With few-shot examples and hard negatives
python run_generation.py \
--task_type code_summary_retrieval \
--code_language java \
--corpus_root /data/code_corpus \
--save_dir /data/output \
--examples_dir /data/examples \
--num_examples 3 \
--gen_hard_neg \
--num_processes 8
# Code translation task
python run_generation.py \
--task_type code_translation_retrieval \
--code_language python \
--tgt_code_language java \
--corpus_root /data/code_corpus \
--save_dir /data/output \
--language en
# With document length filtering and corpus limits
python run_generation.py \
--task_type bug_desc_retrieval \
--code_language python \
--corpus_root /data/code_corpus \
--save_dir /data/output \
--doc_length "len_0_500 len_500_1000" \
--max_corpus 100000 \
--num_samples 5000
# Code modification task (requires similarity model)
python run_generation.py \
--task_type code_modification_retrieval \
--code_language python \
--corpus_root /data/code_corpus \
--save_dir /data/output \
--sim_model_name BAAI/bge-base-en-v1.5 \
--num_processes 4
# Multi-lingual generation
python run_generation.py \
--task_type web_code_retrieval \
--code_language python \
--corpus_root /data/code_corpus \
--save_dir /data/output \
--language zh \
--model Qwen2.5-72B-Instruct
# Debug mode with seed for reproducibility
python run_generation.py \
--task_type api_usage_retrieval \
--code_language javascript \
--corpus_root /data/code_corpus \
--save_dir /data/output \
--debug_mode \
--seed 42 \
--num_samples 100
Command-Line Arguments
Required Arguments
- --task_type: One of 63 task types from TaskType enum
- --code_language: Programming language (python, java, javascript, etc.)
- --corpus_root: Root directory of code corpus
- --save_dir: Output directory for generated data
Model Configuration
- --model: LLM model name (default: Qwen2.5-72B-Instruct)
- --model_type: "open-source" or proprietary (default: open-source)
- --port: vllm server port (default: 8000)
- --sim_model_name: Similarity model for special tasks
Data Configuration
- --language: Natural language ISO code (default: en)
- --tgt_code_language: Target language for translation
- --num_samples: Number of samples (default: -1 for all)
- --doc_length: Document length filter (default: len_0_500)
- --max_corpus: Maximum corpus size (default: 500000)
- --external_path: Additional corpus paths
Generation Configuration
- --num_processes: Parallel workers (default: 1)
- --examples_dir: Few-shot examples directory
- --num_examples: Number of examples (default: 3)
- --gen_hard_neg: Enable hard negative generation
- --cache_dir: Cache directory for generation results
Other Options
- --seed: Random seed for reproducibility
- --overwrite: Overwrite existing output
- --debug_mode: Enable verbose debugging
Performance Considerations
Multi-Processing
- Automatically limited to 80% of CPU cores
- Each process handles a portion of the corpus
- Thread-safe caching per document
Caching Strategy
- Per-document caching using MD5 hash as key
- Avoids regenerating existing triplets
- Cache stored in {save_dir}/{language}/{task_type}/gen_cache_dir/
Memory Management
- GPU cache cleared after similarity search
- Corpus loaded in batches if max_corpus is set
- Separate handling of large corpus for similarity tasks
Deduplication
- MD5-based deduplication for queries and positives
- Merges with existing output file
- Prevents duplicate training examples
Special Task Handling
Code Modification & Comparison
These tasks require finding similar code: 1. Load larger corpus for similarity candidates 2. Use embedding model to find top-1 similar code 3. Pass both original and similar code to generator 4. Generate modification instructions or comparisons
Multi-Step Tasks
Tasks like bug_desc_retrieval and code_modification_retrieval: 1. Generate intermediate output (e.g., buggy code) 2. Use intermediate output for second generation step 3. Combine results into final query-positive pair
Translation Tasks
Code translation requires tgt_code_language:
- Examples loaded from specific translation pair directory
- Generation prompt includes both source and target languages
- Output file name includes both languages
Error Handling
- Validates task_type against TaskType enum
- Checks corpus directory existence
- Handles missing examples gracefully (falls back to zero-shot)
- Continues generation even if individual samples fail
- Skips saving if no triplets generated
Integration Points
- CorpusGenerator: Loads and samples code corpus
- TripletGenerator: Generates query-positive pairs with LLM
- Search Module: Provides similarity search for special tasks
- Constant Module: Provides task definitions and prompts