Implementation:FlagOpen FlagEmbedding BGE Coder Utils

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Code Processing, Data Cleaning, Multi-language Support
Last Updated	2026-02-09 00:00 GMT

Overview

Code cleaning utilities supporting 18+ programming languages with comment removal and validation.

Description

This module provides comprehensive code cleaning functionality for multiple programming languages. It implements language-specific comment pattern removal (single-line and multi-line), function/class definition detection for validating code completeness, empty line consolidation and Unicode normalization, and minimum length filtering to exclude trivial snippets. The module supports 18 programming languages including Python, Java, JavaScript, C/C++, Go, Rust, Ruby, PHP, TypeScript, Perl, Shell, SQL, and more. Each language has customized patterns for detecting valid code structures versus mere imports or comments.

Usage

Use this module when preprocessing code data for training embedding models, ensuring corpus quality by filtering out comment-only or import-only files, and normalizing code snippets from diverse sources and languages. The clean_code function is essential in the BGE-Coder data generation pipeline to ensure training data contains meaningful code content.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/BGE_Coder/data_generation/utils.py
Lines: 1-128

Signature

def clean_content(content: str):
    """Clean generated content by removing thinking tags and quotes"""

def clean_code(code: str, lang: str, length_threshold: int = 30) -> str:
    """Clean code by removing comments and validating structure"""

Import

from utils import clean_content, clean_code

I/O Contract

Inputs

Name	Type	Required	Description
content	str	Yes	Raw content string to clean
code	str	Yes	Raw code string to clean
lang	str	Yes	Programming language identifier (e.g., "python", "java")
length_threshold	int	No	Minimum cleaned code length (default: 30)

Outputs

Name	Type	Description
cleaned_content	str	Cleaned content with tags/quotes removed
cleaned_code	str	Cleaned code or empty string if invalid

Usage Examples

# Example 1: Clean Python code
from utils import clean_code

python_code = '''
# This is a comment
import numpy as np

def calculate_sum(a, b):
    """Calculate sum of two numbers."""
    # Add the numbers
    return a + b

class Calculator:
    def __init__(self):
        self.result = 0
'''

cleaned = clean_code(python_code, "python", length_threshold=50)
print(cleaned)
# Output: Code with comments removed, only function/class definitions

# Example 2: Clean Java code
java_code = '''
/**
 * Multi-line comment
 * about the class
 */
public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!");
    }
}
'''

cleaned_java = clean_code(java_code, "java", length_threshold=30)
print(cleaned_java)

# Example 3: Filter out invalid code
invalid_code = "import sys\nimport os\n# Only imports"
result = clean_code(invalid_code, "python", length_threshold=30)
print(result)  # Returns empty string (no actual code logic)

# Example 4: Clean LLM-generated content
from utils import clean_content

llm_output = '<think>Let me think...</think>\n"Here is the code"'
cleaned = clean_content(llm_output)
print(cleaned)  # "Here is the code"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment