Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Coder Utils

From Leeroopedia


Knowledge Sources
Domains Code Processing, Data Cleaning, Multi-language Support
Last Updated 2026-02-09 00:00 GMT

Overview

Code cleaning utilities supporting 18+ programming languages with comment removal and validation.

Description

This module provides comprehensive code cleaning functionality for multiple programming languages. It implements language-specific comment pattern removal (single-line and multi-line), function/class definition detection for validating code completeness, empty line consolidation and Unicode normalization, and minimum length filtering to exclude trivial snippets. The module supports 18 programming languages including Python, Java, JavaScript, C/C++, Go, Rust, Ruby, PHP, TypeScript, Perl, Shell, SQL, and more. Each language has customized patterns for detecting valid code structures versus mere imports or comments.

Usage

Use this module when preprocessing code data for training embedding models, ensuring corpus quality by filtering out comment-only or import-only files, and normalizing code snippets from diverse sources and languages. The clean_code function is essential in the BGE-Coder data generation pipeline to ensure training data contains meaningful code content.

Code Reference

Source Location

Signature

def clean_content(content: str):
    """Clean generated content by removing thinking tags and quotes"""

def clean_code(code: str, lang: str, length_threshold: int = 30) -> str:
    """Clean code by removing comments and validating structure"""

Import

from utils import clean_content, clean_code

I/O Contract

Inputs

Name Type Required Description
content str Yes Raw content string to clean
code str Yes Raw code string to clean
lang str Yes Programming language identifier (e.g., "python", "java")
length_threshold int No Minimum cleaned code length (default: 30)

Outputs

Name Type Description
cleaned_content str Cleaned content with tags/quotes removed
cleaned_code str Cleaned code or empty string if invalid

Usage Examples

# Example 1: Clean Python code
from utils import clean_code

python_code = '''
# This is a comment
import numpy as np

def calculate_sum(a, b):
    """Calculate sum of two numbers."""
    # Add the numbers
    return a + b

class Calculator:
    def __init__(self):
        self.result = 0
'''

cleaned = clean_code(python_code, "python", length_threshold=50)
print(cleaned)
# Output: Code with comments removed, only function/class definitions

# Example 2: Clean Java code
java_code = '''
/**
 * Multi-line comment
 * about the class
 */
public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!");
    }
}
'''

cleaned_java = clean_code(java_code, "java", length_threshold=30)
print(cleaned_java)

# Example 3: Filter out invalid code
invalid_code = "import sys\nimport os\n# Only imports"
result = clean_code(invalid_code, "python", length_threshold=30)
print(result)  # Returns empty string (no actual code logic)

# Example 4: Clean LLM-generated content
from utils import clean_content

llm_output = '<think>Let me think...</think>\n"Here is the code"'
cleaned = clean_content(llm_output)
print(cleaned)  # "Here is the code"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment