Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Coder Run Generation

From Leeroopedia
Revision as of 14:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FlagOpen_FlagEmbedding_BGE_Coder_Run_Generation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Machine Learning, Data Generation, Code Retrieval, Natural Language Processing
Last Updated 2026-02-09 00:00 GMT

Overview

The main orchestration script for BGE Coder's training data generation pipeline, coordinating corpus loading, triplet generation, and data persistence for code embedding models.

Description

This script serves as the entry point for generating synthetic training data for the BGE Coder model. It orchestrates the entire data generation workflow by loading code corpora, generating query-positive-negative triplets using LLMs, applying quality control, and saving the results. The script supports multi-processing for parallel generation, caching for efficiency, and flexible configuration for different task types, languages, and code languages.

Key capabilities include handling 63 different code retrieval tasks across 22 natural languages and 20 programming languages, support for both open-source and proprietary LLMs (via vllm), intelligent corpus sampling based on document length, optional hard negative generation for improved training quality, and robust deduplication based on MD5 hashing.

The script implements special handling for tasks requiring similarity search (code modification and comparison tasks) and provides comprehensive argument parsing for fine-grained control over the generation process.

Usage

Use this script to generate training data for code embedding models by specifying task type, languages, corpus paths, and model configurations. It's designed to be run from the command line with extensive configuration options.

Code Reference

Source Location

Signature

def main(args):
    """Main orchestration function for data generation"""
    pass

def gen_triplets(
    model: str,
    model_type: str,
    port: int,
    positives: List[dict],
    task_type: str,
    language: str,
    code_language: str,
    tgt_code_language: str,
    examples_pool: Optional[List[dict]] = None,
    num_examples: int = 3,
    tqdm_desc: str = "Generating triplets",
    thread_count: int = 1,
    gen_cache_dir: Optional[str] = None,
    debug_mode: bool = False,
    gen_hard_neg: bool = False,
) -> list

def save_triplets(
    triplets: list,
    save_dir: str,
    task_type: str,
    language: str,
    code_language: str,
    tgt_code_language: Optional[str] = None
)

def get_save_path(
    save_dir: str,
    task_type: str,
    language: str,
    code_language: str,
    tgt_code_language: Optional[str] = None
) -> str

Import

from run_generation import main, gen_triplets, save_triplets
# Typically run as a script: python run_generation.py --task_type web_code_retrieval ...

I/O Contract

Inputs

Name Type Required Description
task_type str Yes Task type from TaskType enum (e.g., "web_code_retrieval")
code_language str Yes Programming language for corpus (e.g., "python")
corpus_root str Yes Root directory containing code corpus files
save_dir str Yes Directory to save generated triplets
language str No Natural language (default: "en")
tgt_code_language str No Target code language for translation tasks
model str No LLM model name (default: "Qwen2.5-72B-Instruct")
model_type str No Model type: "open-source" or proprietary (default: "open-source")
port int No Port for vllm server (default: 8000)
num_samples int No Number of samples to generate (default: -1 for all)
num_processes int No Parallel processes (default: 1)
doc_length str No Document length filter (default: "len_0_500")
examples_dir str No Directory with few-shot examples
num_examples int No Number of few-shot examples (default: 3)
max_corpus int No Maximum corpus size (default: 500000)
sim_model_name str No Model for similarity search in special tasks
gen_hard_neg bool No Generate hard negatives (default: False)
seed int No Random seed for reproducibility
overwrite bool No Overwrite existing data (default: False)
debug_mode bool No Enable debug output (default: False)

Outputs

Name Type Description
triplets List[dict] Generated query-positive-negative triplets
output_file JSONL File at {save_dir}/{language}/{task_type}/{language}-{code_language}-triplets.jsonl

Output File Format

Each line in the output JSONL file contains:

{
    "prompt": "Task instruction string",
    "query": "Generated query (text or code)",
    "pos": ["Positive example (code or text)"],
    "neg": ["Negative example 1", "Negative example 2", ...]
}

Workflow

Step 1: Argument Parsing

Parse command-line arguments including task configuration, model settings, corpus paths, and generation parameters.

Step 2: Corpus Loading

Use CorpusGenerator to:

  • Load code files from corpus_root/code_language directory
  • Filter by document length (e.g., len_0_500, len_500_1000)
  • Sample up to max_corpus documents
  • Optionally load external corpus from external_path
  • Split into positives (primary corpus) and large_positives (for similarity search)

Step 3: Similarity Search (Special Tasks)

For code_modification_retrieval and code_comparison_retrieval:

  • Use similarity model to find top-1 most similar code for each positive
  • Store similar code in positives[i]['similar']
  • Clear GPU cache after similarity computation

Step 4: Example Loading (Optional)

If examples_dir is provided:

  • Load task-specific few-shot examples
  • Sample up to 30 examples from the pool
  • Examples guide LLM generation with expected format

Step 5: Triplet Generation

Use TripletGenerator to:

  • Generate queries from positives using LLM
  • Apply quality control filtering
  • Generate hard negatives (if gen_hard_neg=True)
  • Cache results per document (using MD5 hash)
  • Support multi-processing for parallel generation

Step 6: Data Persistence

Save generated triplets:

  • Deduplicate based on query and positive MD5 hashes
  • Merge with existing data (if any)
  • Save to JSONL format
  • Organize by language and task type

Usage Examples

# Basic usage: Generate Python web code retrieval data
python run_generation.py \
    --task_type web_code_retrieval \
    --code_language python \
    --corpus_root /data/code_corpus \
    --save_dir /data/output \
    --language en \
    --model Qwen2.5-72B-Instruct \
    --num_samples 1000

# With few-shot examples and hard negatives
python run_generation.py \
    --task_type code_summary_retrieval \
    --code_language java \
    --corpus_root /data/code_corpus \
    --save_dir /data/output \
    --examples_dir /data/examples \
    --num_examples 3 \
    --gen_hard_neg \
    --num_processes 8

# Code translation task
python run_generation.py \
    --task_type code_translation_retrieval \
    --code_language python \
    --tgt_code_language java \
    --corpus_root /data/code_corpus \
    --save_dir /data/output \
    --language en

# With document length filtering and corpus limits
python run_generation.py \
    --task_type bug_desc_retrieval \
    --code_language python \
    --corpus_root /data/code_corpus \
    --save_dir /data/output \
    --doc_length "len_0_500 len_500_1000" \
    --max_corpus 100000 \
    --num_samples 5000

# Code modification task (requires similarity model)
python run_generation.py \
    --task_type code_modification_retrieval \
    --code_language python \
    --corpus_root /data/code_corpus \
    --save_dir /data/output \
    --sim_model_name BAAI/bge-base-en-v1.5 \
    --num_processes 4

# Multi-lingual generation
python run_generation.py \
    --task_type web_code_retrieval \
    --code_language python \
    --corpus_root /data/code_corpus \
    --save_dir /data/output \
    --language zh \
    --model Qwen2.5-72B-Instruct

# Debug mode with seed for reproducibility
python run_generation.py \
    --task_type api_usage_retrieval \
    --code_language javascript \
    --corpus_root /data/code_corpus \
    --save_dir /data/output \
    --debug_mode \
    --seed 42 \
    --num_samples 100

Command-Line Arguments

Required Arguments

  • --task_type: One of 63 task types from TaskType enum
  • --code_language: Programming language (python, java, javascript, etc.)
  • --corpus_root: Root directory of code corpus
  • --save_dir: Output directory for generated data

Model Configuration

  • --model: LLM model name (default: Qwen2.5-72B-Instruct)
  • --model_type: "open-source" or proprietary (default: open-source)
  • --port: vllm server port (default: 8000)
  • --sim_model_name: Similarity model for special tasks

Data Configuration

  • --language: Natural language ISO code (default: en)
  • --tgt_code_language: Target language for translation
  • --num_samples: Number of samples (default: -1 for all)
  • --doc_length: Document length filter (default: len_0_500)
  • --max_corpus: Maximum corpus size (default: 500000)
  • --external_path: Additional corpus paths

Generation Configuration

  • --num_processes: Parallel workers (default: 1)
  • --examples_dir: Few-shot examples directory
  • --num_examples: Number of examples (default: 3)
  • --gen_hard_neg: Enable hard negative generation
  • --cache_dir: Cache directory for generation results

Other Options

  • --seed: Random seed for reproducibility
  • --overwrite: Overwrite existing output
  • --debug_mode: Enable verbose debugging

Performance Considerations

Multi-Processing

  • Automatically limited to 80% of CPU cores
  • Each process handles a portion of the corpus
  • Thread-safe caching per document

Caching Strategy

  • Per-document caching using MD5 hash as key
  • Avoids regenerating existing triplets
  • Cache stored in {save_dir}/{language}/{task_type}/gen_cache_dir/

Memory Management

  • GPU cache cleared after similarity search
  • Corpus loaded in batches if max_corpus is set
  • Separate handling of large corpus for similarity tasks

Deduplication

  • MD5-based deduplication for queries and positives
  • Merges with existing output file
  • Prevents duplicate training examples

Special Task Handling

Code Modification & Comparison

These tasks require finding similar code: 1. Load larger corpus for similarity candidates 2. Use embedding model to find top-1 similar code 3. Pass both original and similar code to generator 4. Generate modification instructions or comparisons

Multi-Step Tasks

Tasks like bug_desc_retrieval and code_modification_retrieval: 1. Generate intermediate output (e.g., buggy code) 2. Use intermediate output for second generation step 3. Combine results into final query-positive pair

Translation Tasks

Code translation requires tgt_code_language:

  • Examples loaded from specific translation pair directory
  • Generation prompt includes both source and target languages
  • Output file name includes both languages

Error Handling

  • Validates task_type against TaskType enum
  • Checks corpus directory existence
  • Handles missing examples gracefully (falls back to zero-shot)
  • Continues generation even if individual samples fail
  • Skips saving if no triplets generated

Integration Points

  • CorpusGenerator: Loads and samples code corpus
  • TripletGenerator: Generates query-positive pairs with LLM
  • Search Module: Provides similarity search for special tasks
  • Constant Module: Provides task definitions and prompts

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment