Implementation:FlagOpen FlagEmbedding BGE M3 Split Data

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Data Processing, Length-based Splitting, Training Data Preparation
Last Updated	2026-02-09 00:00 GMT

Overview

A data preprocessing tool that splits training data into length-based buckets for efficient batch processing.

Description

The SplitByLengthHandler class splits training data into multiple files based on the maximum token length of query-passage pairs. It tokenizes all texts in each training example (query + positives + negatives), determines the maximum length among all texts per example, assigns examples to length range buckets (e.g., 0-500, 500-1000 tokens), and saves split datasets to separate JSONL files. This enables length-aware batch construction during training to minimize padding and improve training efficiency. The tool uses parallel processing for fast tokenization and supports both standard and knowledge distillation data formats.

Usage

Use this tool when preparing training data for BGE-M3 or other embedding models, optimizing batch construction by grouping similar-length examples together, and reducing computational waste from excessive padding in variable-length datasets. Run this preprocessing step before training to organize data by length ranges.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/BGE_M3/split_data_by_length.py
Lines: 1-209

Signature

class SplitByLengthHandler:
    def __init__(
        self,
        model_name_or_path: str,
        cache_dir: str=None,
        num_proc: int=16,
        length_list: list=[0, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000],
        overwrite: bool=False
    ):
        pass

    def run(self, input_path: str, output_dir: str, log_name: str=None):
        """Process input file(s) and split by length ranges"""

Import

from split_data_by_length import SplitByLengthHandler

I/O Contract

Inputs

Name	Type	Required	Description
model_name_or_path	str	Yes	Tokenizer model name or path
input_path	str	Yes	Input JSONL file or directory
output_dir	str	Yes	Output directory for split files
cache_dir	str	No	Cache directory for datasets library
num_proc	int	No	Number of parallel processes (default: 16)
length_list	list	No	Length boundaries for splitting (default: [0, 500, 1000, ...])
overwrite	bool	No	Whether to overwrite existing files (default: False)

Outputs

Name	Type	Description
split_files	List[str]	JSONL files named with length ranges (e.g., file_len-0-500.jsonl)
log_file	str	JSON log file with split statistics

Usage Examples

# Example 1: Command-line usage
# python split_data_by_length.py \
#   --input_path train_data/data.jsonl \
#   --output_dir train_data_split \
#   --model_name_or_path BAAI/bge-m3 \
#   --cache_dir .cache \
#   --length_list 0 500 1000 2000 3000 4000 5000 \
#   --num_proc 16 \
#   --overwrite

# Example 2: Programmatic usage
from split_data_by_length import SplitByLengthHandler

handler = SplitByLengthHandler(
    model_name_or_path="BAAI/bge-m3",
    cache_dir=".cache",
    num_proc=16,
    length_list=[0, 500, 1000, 2000, 3000, 4000],
    overwrite=False
)

handler.run(
    input_path="./train_data",
    output_dir="./train_data_split",
    log_name=".split_log"
)

# Example 3: Process single file
handler = SplitByLengthHandler(
    model_name_or_path="BAAI/bge-m3",
    num_proc=8,
    length_list=[0, 512, 1024, 2048]
)

handler.run(
    input_path="./data/train.jsonl",
    output_dir="./data/split"
)

# Output files:
# - ./data/split/train_len-0-512.jsonl
# - ./data/split/train_len-512-1024.jsonl
# - ./data/split/train_len-1024-2048.jsonl
# - ./data/split/.split_log

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment