Implementation:FlagOpen FlagEmbedding BGE Coder Utils
| Knowledge Sources | |
|---|---|
| Domains | Code Processing, Data Cleaning, Multi-language Support |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Code cleaning utilities supporting 18+ programming languages with comment removal and validation.
Description
This module provides comprehensive code cleaning functionality for multiple programming languages. It implements language-specific comment pattern removal (single-line and multi-line), function/class definition detection for validating code completeness, empty line consolidation and Unicode normalization, and minimum length filtering to exclude trivial snippets. The module supports 18 programming languages including Python, Java, JavaScript, C/C++, Go, Rust, Ruby, PHP, TypeScript, Perl, Shell, SQL, and more. Each language has customized patterns for detecting valid code structures versus mere imports or comments.
Usage
Use this module when preprocessing code data for training embedding models, ensuring corpus quality by filtering out comment-only or import-only files, and normalizing code snippets from diverse sources and languages. The clean_code function is essential in the BGE-Coder data generation pipeline to ensure training data contains meaningful code content.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_Coder/data_generation/utils.py
- Lines: 1-128
Signature
def clean_content(content: str):
"""Clean generated content by removing thinking tags and quotes"""
def clean_code(code: str, lang: str, length_threshold: int = 30) -> str:
"""Clean code by removing comments and validating structure"""
Import
from utils import clean_content, clean_code
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| content | str | Yes | Raw content string to clean |
| code | str | Yes | Raw code string to clean |
| lang | str | Yes | Programming language identifier (e.g., "python", "java") |
| length_threshold | int | No | Minimum cleaned code length (default: 30) |
Outputs
| Name | Type | Description |
|---|---|---|
| cleaned_content | str | Cleaned content with tags/quotes removed |
| cleaned_code | str | Cleaned code or empty string if invalid |
Usage Examples
# Example 1: Clean Python code
from utils import clean_code
python_code = '''
# This is a comment
import numpy as np
def calculate_sum(a, b):
"""Calculate sum of two numbers."""
# Add the numbers
return a + b
class Calculator:
def __init__(self):
self.result = 0
'''
cleaned = clean_code(python_code, "python", length_threshold=50)
print(cleaned)
# Output: Code with comments removed, only function/class definitions
# Example 2: Clean Java code
java_code = '''
/**
* Multi-line comment
* about the class
*/
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World!");
}
}
'''
cleaned_java = clean_code(java_code, "java", length_threshold=30)
print(cleaned_java)
# Example 3: Filter out invalid code
invalid_code = "import sys\nimport os\n# Only imports"
result = clean_code(invalid_code, "python", length_threshold=30)
print(result) # Returns empty string (no actual code logic)
# Example 4: Clean LLM-generated content
from utils import clean_content
llm_output = '<think>Let me think...</think>\n"Here is the code"'
cleaned = clean_content(llm_output)
print(cleaned) # "Here is the code"