Implementation:FlagOpen FlagEmbedding BGE Coder CorpusGenerator
| Knowledge Sources | |
|---|---|
| Domains | Code Retrieval, Data Generation, Corpus Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A corpus generator class for loading and processing code documents from the CoIR-Retrieval dataset for BGE-Coder training.
Description
The CorpusGenerator class handles loading code documents from directory structures, filtering by document length categories, and cleaning code snippets. It supports loading corpus data from both local directories and external JSONL files, applies language-specific code cleaning to remove comments and invalid code, and can sample subsets of the corpus for training and evaluation. The class is designed to work with multi-language code datasets organized by programming language directories and supports configurable document length filtering (e.g., 0-500 tokens, 500-1000 tokens).
Usage
Use this class when preparing code corpus data for training or evaluating the BGE-Coder embedding model. It is particularly useful for loading code snippets from the CoIR-Retrieval benchmark, filtering by document length to handle variable-length code documents, and preparing positive/negative samples for contrastive learning in code retrieval tasks.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_Coder/data_generation/corpus_generator.py
- Lines: 1-104
Signature
class CorpusGenerator:
def __init__(
self,
cache_dir: str = None,
):
pass
def run(
self,
num_samples: int = -1,
max_corpus: int = -1,
corpus_dir: str = None,
doc_length: List[str] = ["len_0_500"],
external_path: List[str] = None,
source_language: str = None
) -> Tuple[List[dict], List[dict]]:
"""Load and process corpus data, returning small and full corpus lists"""
Import
from corpus_generator import CorpusGenerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cache_dir | str | No | Cache directory for loading datasets |
| num_samples | int | No | Number of samples to select for small corpus (-1 for all) |
| max_corpus | int | No | Maximum size of full corpus (-1 for all) |
| corpus_dir | str | No | Directory containing corpus JSONL files |
| doc_length | List[str] | No | Document length categories to load (default: ["len_0_500"]) |
| external_path | List[str] | No | External JSONL file paths to include |
| source_language | str | No | Programming language of the source code |
Outputs
| Name | Type | Description |
|---|---|---|
| small_corpus_list | List[dict] | Sampled subset of corpus for queries/small tasks |
| corpus_list | List[dict] | Full corpus list for retrieval pool |
Usage Examples
# Example: Load Python code corpus with short documents
from corpus_generator import CorpusGenerator
generator = CorpusGenerator(cache_dir=".cache")
# Load corpus with document length filtering
small_corpus, full_corpus = generator.run(
num_samples=1000,
max_corpus=10000,
corpus_dir="./data/python",
doc_length=["len_0_500", "len_500_1000"],
external_path=["./external_data/python_extra.jsonl"],
source_language="python"
)
print(f"Small corpus size: {len(small_corpus)}")
print(f"Full corpus size: {len(full_corpus)}")
# Example document structure
print(small_corpus[0].keys()) # dict_keys(['text', 'language', ...])