Implementation:FlagOpen FlagEmbedding BGE M3 Split Data
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Length-based Splitting, Training Data Preparation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A data preprocessing tool that splits training data into length-based buckets for efficient batch processing.
Description
The SplitByLengthHandler class splits training data into multiple files based on the maximum token length of query-passage pairs. It tokenizes all texts in each training example (query + positives + negatives), determines the maximum length among all texts per example, assigns examples to length range buckets (e.g., 0-500, 500-1000 tokens), and saves split datasets to separate JSONL files. This enables length-aware batch construction during training to minimize padding and improve training efficiency. The tool uses parallel processing for fast tokenization and supports both standard and knowledge distillation data formats.
Usage
Use this tool when preparing training data for BGE-M3 or other embedding models, optimizing batch construction by grouping similar-length examples together, and reducing computational waste from excessive padding in variable-length datasets. Run this preprocessing step before training to organize data by length ranges.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_M3/split_data_by_length.py
- Lines: 1-209
Signature
class SplitByLengthHandler:
def __init__(
self,
model_name_or_path: str,
cache_dir: str=None,
num_proc: int=16,
length_list: list=[0, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000],
overwrite: bool=False
):
pass
def run(self, input_path: str, output_dir: str, log_name: str=None):
"""Process input file(s) and split by length ranges"""
Import
from split_data_by_length import SplitByLengthHandler
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | Yes | Tokenizer model name or path |
| input_path | str | Yes | Input JSONL file or directory |
| output_dir | str | Yes | Output directory for split files |
| cache_dir | str | No | Cache directory for datasets library |
| num_proc | int | No | Number of parallel processes (default: 16) |
| length_list | list | No | Length boundaries for splitting (default: [0, 500, 1000, ...]) |
| overwrite | bool | No | Whether to overwrite existing files (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| split_files | List[str] | JSONL files named with length ranges (e.g., file_len-0-500.jsonl) |
| log_file | str | JSON log file with split statistics |
Usage Examples
# Example 1: Command-line usage
# python split_data_by_length.py \
# --input_path train_data/data.jsonl \
# --output_dir train_data_split \
# --model_name_or_path BAAI/bge-m3 \
# --cache_dir .cache \
# --length_list 0 500 1000 2000 3000 4000 5000 \
# --num_proc 16 \
# --overwrite
# Example 2: Programmatic usage
from split_data_by_length import SplitByLengthHandler
handler = SplitByLengthHandler(
model_name_or_path="BAAI/bge-m3",
cache_dir=".cache",
num_proc=16,
length_list=[0, 500, 1000, 2000, 3000, 4000],
overwrite=False
)
handler.run(
input_path="./train_data",
output_dir="./train_data_split",
log_name=".split_log"
)
# Example 3: Process single file
handler = SplitByLengthHandler(
model_name_or_path="BAAI/bge-m3",
num_proc=8,
length_list=[0, 512, 1024, 2048]
)
handler.run(
input_path="./data/train.jsonl",
output_dir="./data/split"
)
# Output files:
# - ./data/split/train_len-0-512.jsonl
# - ./data/split/train_len-512-1024.jsonl
# - ./data/split/train_len-1024-2048.jsonl
# - ./data/split/.split_log