Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Coder Constants

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, Code Retrieval, Information Retrieval
Last Updated 2026-02-09 00:00 GMT

Overview

A comprehensive constants module defining task types, languages, and code languages for the BGE Coder data generation framework, which supports multi-lingual code retrieval across various programming paradigms.

Description

This module serves as the central configuration system for the BGE Coder project, defining 63 different code retrieval task types organized into four main categories: text-to-code (10 tasks), code-to-text (10 tasks), code-to-code (18 tasks), and hybrid tasks (8 tasks). It provides extensive support for 22 natural languages and 20 programming languages, with intelligent batch size management, task-specific instruction templates, and quality control mechanisms for data generation.

The module includes sophisticated functionality for generating prompts for LLM-based data synthesis, creating quality control prompts for validation, and managing code translation pairs. It implements specialized handling for multi-step tasks that require sequential processing, such as bug description retrieval and code modification tasks.

Usage

This module is used as the foundational configuration layer for BGE Coder's data generation pipeline, providing task definitions, prompt templates, and language specifications that drive the creation of training data for code embedding models.

Code Reference

Source Location

Signature

class TaskType(Enum):
    # Task type definitions for code retrieval
    web_code_retrieval = "Web Query to Code Retrieval"
    # ... (63 total task types)

class Language(Enum):
    # Natural language definitions
    en = 'English'
    zh = 'Simplified Chinese'
    # ... (22 total languages)

class CodeLanguage(Enum):
    # Programming language definitions
    java = "Java"
    python = "Python"
    # ... (20 total code languages)

@dataclass
class Task:
    task_type: TaskType
    language: Language
    code_language: CodeLanguage = CodeLanguage.null
    task_instruction: str = None
    tgt_code_language: CodeLanguage = CodeLanguage.null
    main_task_type: str = None

def get_task_def_by_task_type(task_type: Union[str, TaskType]) -> Tuple[str, TaskType, str]
def get_task(task_type: str, language: str, code_language: str, tgt_code_language: Optional[str] = None) -> Task
def get_generation_prompt(task: Task, text: str, text_b: Optional[str] = None, examples: Optional[List[dict]] = None, idx: Optional[int] = None) -> str
def get_quality_control_prompt(task: Task, query: str, pos: str) -> str
def get_gen_hard_neg_prompt(task: Task, query: str, pos: str) -> str

Import

from constant import TaskType, Language, CodeLanguage, Task
from constant import get_task_def_by_task_type, get_task
from constant import get_generation_prompt, get_quality_control_prompt

I/O Contract

Inputs

Name Type Required Description
task_type Union[str, TaskType] Yes The task type identifier (e.g., "web_code_retrieval")
language str Yes ISO 639-1 language code (e.g., "en", "zh")
code_language str Yes Programming language identifier (e.g., "python", "java")
tgt_code_language Optional[str] No Target code language for translation tasks
text str Yes Input text/code for prompt generation
text_b Optional[str] No Second input text for comparison/modification tasks
examples Optional[List[dict]] No Few-shot examples for generation
idx Optional[int] No Step index for multi-step generation tasks

Outputs

Name Type Description
task_instruction str Natural language instruction describing the retrieval task
main_task_type str Category of task (text2code, code2text, code2code, hybrid)
generation_prompt str LLM prompt for generating training data
quality_control_prompt str Prompt for validating generated data quality
Task object Task Dataclass containing all task configuration

Task Categories

Text-to-Code Tasks

  • Web Code Retrieval: Given a web search query, retrieve relevant code
  • Code Contest Retrieval: Match competitive programming problems to solutions
  • Text2SQL Retrieval: Convert natural language questions to SQL queries
  • Error Message Retrieval: Find code solutions for error messages
  • Code Explanation Retrieval: Match textual descriptions to code implementations
  • API Usage Retrieval: Retrieve code examples demonstrating API usage
  • Bug Description Retrieval: Find code fixes for bug descriptions
  • Pseudocode Retrieval: Match pseudocode to actual implementations
  • Tutorial Query Retrieval: Find code examples for tutorial queries
  • Algorithm Description Retrieval: Match algorithm descriptions to code

Code-to-Text Tasks

  • Code Summary Retrieval: Generate natural language summaries of code
  • Code Review Retrieval: Retrieve code reviews explaining functionality
  • Code Intent Retrieval: Extract developer intent from code
  • Code Optimization Retrieval: Generate optimization suggestions
  • Tutorial Retrieval: Find tutorials for similar code patterns
  • Code Issue Discussion Retrieval: Retrieve bug reports and discussions
  • API Reference Retrieval: Find API documentation for code
  • Code Walkthrough Retrieval: Generate step-by-step explanations
  • Code Error Explanation Retrieval: Explain potential errors
  • Code to Requirement Retrieval: Extract requirements from code

Code-to-Code Tasks

  • Code Context Retrieval: Find the continuation of code segments
  • Similar Code Retrieval: Find semantically equivalent code
  • Code Translation Retrieval: Translate code between languages
  • Code Refinement Retrieval: Generate improved versions of code
  • Secure Code Retrieval: Find security-enhanced versions
  • Code Version Update Retrieval: Update code to newer language versions
  • Code Example Retrieval: Find usage examples for libraries
  • Code Dependency Retrieval: Extract dependencies
  • Code Pattern Retrieval: Find code following design patterns
  • Code History Retrieval: Find previous versions
  • Code Integration Retrieval: Find integration examples
  • Optimized Code Retrieval: Find performance-optimized versions
  • Code Simplification Retrieval: Generate simplified versions
  • Code Modularization Retrieval: Create modular versions
  • Code Augmentation Retrieval: Add functionality while preserving behavior
  • Error Handling Code Retrieval: Add error handling
  • Code Documentation Retrieval: Add inline documentation
  • Library Adaptation Retrieval: Adapt code to different libraries

Hybrid Tasks

  • Code Modification Retrieval: Apply natural language modification instructions
  • Code Bug Fix Example Retrieval: Find fixes for specific bugs
  • Code Refactoring Pattern Retrieval: Apply refactoring patterns
  • Code Style Guideline Example Retrieval: Apply style guidelines
  • Code Migration Retrieval: Migrate code to new requirements
  • Code Optimization Hybrid Retrieval: Apply specific optimizations
  • Code Comparison Retrieval: Compare and explain differences
  • Code Best Practices Retrieval: Apply best practices
  • Security Vulnerability Fix Retrieval: Fix security vulnerabilities

Language Support

Natural Languages

Primary languages (full task coverage):

  • English (en)
  • Simplified Chinese (zh)

Additional languages (selected tasks):

  • Arabic (ar), Bengali (bn), Spanish (es), Persian (fa), Finnish (fi), French (fr), Hindi (hi), Indonesian (id), Japanese (ja), Korean (ko), Russian (ru), Swahili (sw), Telugu (te), Thai (th), German (de), Yoruba (yo), Italian (it), Portuguese (pt), Vietnamese (vi), Traditional Chinese (zh_tw)

Programming Languages

High priority (3000 samples each):

  • Java, Python, JavaScript, PHP, Ruby, Go, C#, C++

Medium priority (1500 samples each):

  • C, Rust, TypeScript, Perl, Shell, SQL

Low priority (750 samples each):

  • Batchfile, FORTRAN, Haskell, Lua, PowerShell, Visual Basic

Special Features

Code Translation Pairs

The module defines 16 translation pairs for code translation tasks:

  • C family: C ↔ C++ ↔ C# ↔ Java
  • Scripting: Python ↔ Ruby ↔ Perl
  • Web: JavaScript ↔ TypeScript ↔ PHP
  • Systems: Rust ↔ Go ↔ C++
  • Cross-family: Python ↔ C++

Multi-Step Task Support

Several tasks require multi-step generation (SPECIAL_TASK_STEPS):

  • Code Modification Retrieval (2 steps)
  • Code Issue Discussion Retrieval (2 steps)
  • Code Version Update Retrieval (2 steps)
  • Bug Description Retrieval (2 steps)
  • And 8 other hybrid tasks

Dynamic Batch Sizing

The module includes logic for computing batch sizes based on document length to optimize memory usage during training.

Usage Examples

# Get task definition
main_type, task_type, instruction = get_task_def_by_task_type("web_code_retrieval")
# main_type: "text2code"
# instruction: "Given a web search query, retrieve relevant code..."

# Create a task object
task = get_task(
    task_type="web_code_retrieval",
    language="en",
    code_language="python"
)

# Generate a data generation prompt
code = "def hello():\n    print('Hello, world!')"
gen_prompt = get_generation_prompt(task=task, text=code)
# Returns prompt asking LLM to generate a web query for the code

# Generate quality control prompt
query = "How to print hello world in python"
pos = code
qc_prompt = get_quality_control_prompt(task=task, query=query, pos=pos)
# Returns prompt asking LLM to validate the query-code pair

# Handle multi-step task
task = get_task(
    task_type="code_modification_retrieval",
    language="en",
    code_language="python"
)
# Step 1: Generate differences
prompt1 = get_generation_prompt(task=task, text=code_v1, text_b=code_v2, idx=0)
# Step 2: Generate modification instruction
prompt2 = get_generation_prompt(task=task, text=differences, idx=1)

# Code translation task
translation_task = get_task(
    task_type="code_translation_retrieval",
    language="en",
    code_language="python",
    tgt_code_language="java"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment