Implementation:FlagOpen FlagEmbedding BGE Coder Constants
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Code Retrieval, Information Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A comprehensive constants module defining task types, languages, and code languages for the BGE Coder data generation framework, which supports multi-lingual code retrieval across various programming paradigms.
Description
This module serves as the central configuration system for the BGE Coder project, defining 63 different code retrieval task types organized into four main categories: text-to-code (10 tasks), code-to-text (10 tasks), code-to-code (18 tasks), and hybrid tasks (8 tasks). It provides extensive support for 22 natural languages and 20 programming languages, with intelligent batch size management, task-specific instruction templates, and quality control mechanisms for data generation.
The module includes sophisticated functionality for generating prompts for LLM-based data synthesis, creating quality control prompts for validation, and managing code translation pairs. It implements specialized handling for multi-step tasks that require sequential processing, such as bug description retrieval and code modification tasks.
Usage
This module is used as the foundational configuration layer for BGE Coder's data generation pipeline, providing task definitions, prompt templates, and language specifications that drive the creation of training data for code embedding models.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_Coder/data_generation/constant.py
- Lines: 1-1302
Signature
class TaskType(Enum):
# Task type definitions for code retrieval
web_code_retrieval = "Web Query to Code Retrieval"
# ... (63 total task types)
class Language(Enum):
# Natural language definitions
en = 'English'
zh = 'Simplified Chinese'
# ... (22 total languages)
class CodeLanguage(Enum):
# Programming language definitions
java = "Java"
python = "Python"
# ... (20 total code languages)
@dataclass
class Task:
task_type: TaskType
language: Language
code_language: CodeLanguage = CodeLanguage.null
task_instruction: str = None
tgt_code_language: CodeLanguage = CodeLanguage.null
main_task_type: str = None
def get_task_def_by_task_type(task_type: Union[str, TaskType]) -> Tuple[str, TaskType, str]
def get_task(task_type: str, language: str, code_language: str, tgt_code_language: Optional[str] = None) -> Task
def get_generation_prompt(task: Task, text: str, text_b: Optional[str] = None, examples: Optional[List[dict]] = None, idx: Optional[int] = None) -> str
def get_quality_control_prompt(task: Task, query: str, pos: str) -> str
def get_gen_hard_neg_prompt(task: Task, query: str, pos: str) -> str
Import
from constant import TaskType, Language, CodeLanguage, Task
from constant import get_task_def_by_task_type, get_task
from constant import get_generation_prompt, get_quality_control_prompt
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| task_type | Union[str, TaskType] | Yes | The task type identifier (e.g., "web_code_retrieval") |
| language | str | Yes | ISO 639-1 language code (e.g., "en", "zh") |
| code_language | str | Yes | Programming language identifier (e.g., "python", "java") |
| tgt_code_language | Optional[str] | No | Target code language for translation tasks |
| text | str | Yes | Input text/code for prompt generation |
| text_b | Optional[str] | No | Second input text for comparison/modification tasks |
| examples | Optional[List[dict]] | No | Few-shot examples for generation |
| idx | Optional[int] | No | Step index for multi-step generation tasks |
Outputs
| Name | Type | Description |
|---|---|---|
| task_instruction | str | Natural language instruction describing the retrieval task |
| main_task_type | str | Category of task (text2code, code2text, code2code, hybrid) |
| generation_prompt | str | LLM prompt for generating training data |
| quality_control_prompt | str | Prompt for validating generated data quality |
| Task object | Task | Dataclass containing all task configuration |
Task Categories
Text-to-Code Tasks
- Web Code Retrieval: Given a web search query, retrieve relevant code
- Code Contest Retrieval: Match competitive programming problems to solutions
- Text2SQL Retrieval: Convert natural language questions to SQL queries
- Error Message Retrieval: Find code solutions for error messages
- Code Explanation Retrieval: Match textual descriptions to code implementations
- API Usage Retrieval: Retrieve code examples demonstrating API usage
- Bug Description Retrieval: Find code fixes for bug descriptions
- Pseudocode Retrieval: Match pseudocode to actual implementations
- Tutorial Query Retrieval: Find code examples for tutorial queries
- Algorithm Description Retrieval: Match algorithm descriptions to code
Code-to-Text Tasks
- Code Summary Retrieval: Generate natural language summaries of code
- Code Review Retrieval: Retrieve code reviews explaining functionality
- Code Intent Retrieval: Extract developer intent from code
- Code Optimization Retrieval: Generate optimization suggestions
- Tutorial Retrieval: Find tutorials for similar code patterns
- Code Issue Discussion Retrieval: Retrieve bug reports and discussions
- API Reference Retrieval: Find API documentation for code
- Code Walkthrough Retrieval: Generate step-by-step explanations
- Code Error Explanation Retrieval: Explain potential errors
- Code to Requirement Retrieval: Extract requirements from code
Code-to-Code Tasks
- Code Context Retrieval: Find the continuation of code segments
- Similar Code Retrieval: Find semantically equivalent code
- Code Translation Retrieval: Translate code between languages
- Code Refinement Retrieval: Generate improved versions of code
- Secure Code Retrieval: Find security-enhanced versions
- Code Version Update Retrieval: Update code to newer language versions
- Code Example Retrieval: Find usage examples for libraries
- Code Dependency Retrieval: Extract dependencies
- Code Pattern Retrieval: Find code following design patterns
- Code History Retrieval: Find previous versions
- Code Integration Retrieval: Find integration examples
- Optimized Code Retrieval: Find performance-optimized versions
- Code Simplification Retrieval: Generate simplified versions
- Code Modularization Retrieval: Create modular versions
- Code Augmentation Retrieval: Add functionality while preserving behavior
- Error Handling Code Retrieval: Add error handling
- Code Documentation Retrieval: Add inline documentation
- Library Adaptation Retrieval: Adapt code to different libraries
Hybrid Tasks
- Code Modification Retrieval: Apply natural language modification instructions
- Code Bug Fix Example Retrieval: Find fixes for specific bugs
- Code Refactoring Pattern Retrieval: Apply refactoring patterns
- Code Style Guideline Example Retrieval: Apply style guidelines
- Code Migration Retrieval: Migrate code to new requirements
- Code Optimization Hybrid Retrieval: Apply specific optimizations
- Code Comparison Retrieval: Compare and explain differences
- Code Best Practices Retrieval: Apply best practices
- Security Vulnerability Fix Retrieval: Fix security vulnerabilities
Language Support
Natural Languages
Primary languages (full task coverage):
- English (en)
- Simplified Chinese (zh)
Additional languages (selected tasks):
- Arabic (ar), Bengali (bn), Spanish (es), Persian (fa), Finnish (fi), French (fr), Hindi (hi), Indonesian (id), Japanese (ja), Korean (ko), Russian (ru), Swahili (sw), Telugu (te), Thai (th), German (de), Yoruba (yo), Italian (it), Portuguese (pt), Vietnamese (vi), Traditional Chinese (zh_tw)
Programming Languages
High priority (3000 samples each):
- Java, Python, JavaScript, PHP, Ruby, Go, C#, C++
Medium priority (1500 samples each):
- C, Rust, TypeScript, Perl, Shell, SQL
Low priority (750 samples each):
- Batchfile, FORTRAN, Haskell, Lua, PowerShell, Visual Basic
Special Features
Code Translation Pairs
The module defines 16 translation pairs for code translation tasks:
- C family: C ↔ C++ ↔ C# ↔ Java
- Scripting: Python ↔ Ruby ↔ Perl
- Web: JavaScript ↔ TypeScript ↔ PHP
- Systems: Rust ↔ Go ↔ C++
- Cross-family: Python ↔ C++
Multi-Step Task Support
Several tasks require multi-step generation (SPECIAL_TASK_STEPS):
- Code Modification Retrieval (2 steps)
- Code Issue Discussion Retrieval (2 steps)
- Code Version Update Retrieval (2 steps)
- Bug Description Retrieval (2 steps)
- And 8 other hybrid tasks
Dynamic Batch Sizing
The module includes logic for computing batch sizes based on document length to optimize memory usage during training.
Usage Examples
# Get task definition
main_type, task_type, instruction = get_task_def_by_task_type("web_code_retrieval")
# main_type: "text2code"
# instruction: "Given a web search query, retrieve relevant code..."
# Create a task object
task = get_task(
task_type="web_code_retrieval",
language="en",
code_language="python"
)
# Generate a data generation prompt
code = "def hello():\n print('Hello, world!')"
gen_prompt = get_generation_prompt(task=task, text=code)
# Returns prompt asking LLM to generate a web query for the code
# Generate quality control prompt
query = "How to print hello world in python"
pos = code
qc_prompt = get_quality_control_prompt(task=task, query=query, pos=pos)
# Returns prompt asking LLM to validate the query-code pair
# Handle multi-step task
task = get_task(
task_type="code_modification_retrieval",
language="en",
code_language="python"
)
# Step 1: Generate differences
prompt1 = get_generation_prompt(task=task, text=code_v1, text_b=code_v2, idx=0)
# Step 2: Generate modification instruction
prompt2 = get_generation_prompt(task=task, text=differences, idx=1)
# Code translation task
translation_task = get_task(
task_type="code_translation_retrieval",
language="en",
code_language="python",
tgt_code_language="java"
)