Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Coder CorpusGenerator

From Leeroopedia
Revision as of 14:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FlagOpen_FlagEmbedding_BGE_Coder_CorpusGenerator.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Code Retrieval, Data Generation, Corpus Processing
Last Updated 2026-02-09 00:00 GMT

Overview

A corpus generator class for loading and processing code documents from the CoIR-Retrieval dataset for BGE-Coder training.

Description

The CorpusGenerator class handles loading code documents from directory structures, filtering by document length categories, and cleaning code snippets. It supports loading corpus data from both local directories and external JSONL files, applies language-specific code cleaning to remove comments and invalid code, and can sample subsets of the corpus for training and evaluation. The class is designed to work with multi-language code datasets organized by programming language directories and supports configurable document length filtering (e.g., 0-500 tokens, 500-1000 tokens).

Usage

Use this class when preparing code corpus data for training or evaluating the BGE-Coder embedding model. It is particularly useful for loading code snippets from the CoIR-Retrieval benchmark, filtering by document length to handle variable-length code documents, and preparing positive/negative samples for contrastive learning in code retrieval tasks.

Code Reference

Source Location

Signature

class CorpusGenerator:
    def __init__(
        self,
        cache_dir: str = None,
    ):
        pass

    def run(
        self,
        num_samples: int = -1,
        max_corpus: int = -1,
        corpus_dir: str = None,
        doc_length: List[str] = ["len_0_500"],
        external_path: List[str] = None,
        source_language: str = None
    ) -> Tuple[List[dict], List[dict]]:
        """Load and process corpus data, returning small and full corpus lists"""

Import

from corpus_generator import CorpusGenerator

I/O Contract

Inputs

Name Type Required Description
cache_dir str No Cache directory for loading datasets
num_samples int No Number of samples to select for small corpus (-1 for all)
max_corpus int No Maximum size of full corpus (-1 for all)
corpus_dir str No Directory containing corpus JSONL files
doc_length List[str] No Document length categories to load (default: ["len_0_500"])
external_path List[str] No External JSONL file paths to include
source_language str No Programming language of the source code

Outputs

Name Type Description
small_corpus_list List[dict] Sampled subset of corpus for queries/small tasks
corpus_list List[dict] Full corpus list for retrieval pool

Usage Examples

# Example: Load Python code corpus with short documents
from corpus_generator import CorpusGenerator

generator = CorpusGenerator(cache_dir=".cache")

# Load corpus with document length filtering
small_corpus, full_corpus = generator.run(
    num_samples=1000,
    max_corpus=10000,
    corpus_dir="./data/python",
    doc_length=["len_0_500", "len_500_1000"],
    external_path=["./external_data/python_extra.jsonl"],
    source_language="python"
)

print(f"Small corpus size: {len(small_corpus)}")
print(f"Full corpus size: {len(full_corpus)}")

# Example document structure
print(small_corpus[0].keys())  # dict_keys(['text', 'language', ...])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment