Implementation:FlagOpen FlagEmbedding BGE Coder CorpusGenerator

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Code Retrieval, Data Generation, Corpus Processing
Last Updated	2026-02-09 00:00 GMT

Overview

A corpus generator class for loading and processing code documents from the CoIR-Retrieval dataset for BGE-Coder training.

Description

The CorpusGenerator class handles loading code documents from directory structures, filtering by document length categories, and cleaning code snippets. It supports loading corpus data from both local directories and external JSONL files, applies language-specific code cleaning to remove comments and invalid code, and can sample subsets of the corpus for training and evaluation. The class is designed to work with multi-language code datasets organized by programming language directories and supports configurable document length filtering (e.g., 0-500 tokens, 500-1000 tokens).

Usage

Use this class when preparing code corpus data for training or evaluating the BGE-Coder embedding model. It is particularly useful for loading code snippets from the CoIR-Retrieval benchmark, filtering by document length to handle variable-length code documents, and preparing positive/negative samples for contrastive learning in code retrieval tasks.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/BGE_Coder/data_generation/corpus_generator.py
Lines: 1-104

Signature

class CorpusGenerator:
    def __init__(
        self,
        cache_dir: str = None,
    ):
        pass

    def run(
        self,
        num_samples: int = -1,
        max_corpus: int = -1,
        corpus_dir: str = None,
        doc_length: List[str] = ["len_0_500"],
        external_path: List[str] = None,
        source_language: str = None
    ) -> Tuple[List[dict], List[dict]]:
        """Load and process corpus data, returning small and full corpus lists"""

Import

from corpus_generator import CorpusGenerator

I/O Contract

Inputs

Name	Type	Required	Description
cache_dir	str	No	Cache directory for loading datasets
num_samples	int	No	Number of samples to select for small corpus (-1 for all)
max_corpus	int	No	Maximum size of full corpus (-1 for all)
corpus_dir	str	No	Directory containing corpus JSONL files
doc_length	List[str]	No	Document length categories to load (default: ["len_0_500"])
external_path	List[str]	No	External JSONL file paths to include
source_language	str	No	Programming language of the source code

Outputs

Name	Type	Description
small_corpus_list	List[dict]	Sampled subset of corpus for queries/small tasks
corpus_list	List[dict]	Full corpus list for retrieval pool

Usage Examples

# Example: Load Python code corpus with short documents
from corpus_generator import CorpusGenerator

generator = CorpusGenerator(cache_dir=".cache")

# Load corpus with document length filtering
small_corpus, full_corpus = generator.run(
    num_samples=1000,
    max_corpus=10000,
    corpus_dir="./data/python",
    doc_length=["len_0_500", "len_500_1000"],
    external_path=["./external_data/python_extra.jsonl"],
    source_language="python"
)

print(f"Small corpus size: {len(small_corpus)}")
print(f"Full corpus size: {len(full_corpus)}")

# Example document structure
print(small_corpus[0].keys())  # dict_keys(['text', 'language', ...])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment