Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE M3 Split Data

From Leeroopedia


Knowledge Sources
Domains Data Processing, Length-based Splitting, Training Data Preparation
Last Updated 2026-02-09 00:00 GMT

Overview

A data preprocessing tool that splits training data into length-based buckets for efficient batch processing.

Description

The SplitByLengthHandler class splits training data into multiple files based on the maximum token length of query-passage pairs. It tokenizes all texts in each training example (query + positives + negatives), determines the maximum length among all texts per example, assigns examples to length range buckets (e.g., 0-500, 500-1000 tokens), and saves split datasets to separate JSONL files. This enables length-aware batch construction during training to minimize padding and improve training efficiency. The tool uses parallel processing for fast tokenization and supports both standard and knowledge distillation data formats.

Usage

Use this tool when preparing training data for BGE-M3 or other embedding models, optimizing batch construction by grouping similar-length examples together, and reducing computational waste from excessive padding in variable-length datasets. Run this preprocessing step before training to organize data by length ranges.

Code Reference

Source Location

Signature

class SplitByLengthHandler:
    def __init__(
        self,
        model_name_or_path: str,
        cache_dir: str=None,
        num_proc: int=16,
        length_list: list=[0, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000],
        overwrite: bool=False
    ):
        pass

    def run(self, input_path: str, output_dir: str, log_name: str=None):
        """Process input file(s) and split by length ranges"""

Import

from split_data_by_length import SplitByLengthHandler

I/O Contract

Inputs

Name Type Required Description
model_name_or_path str Yes Tokenizer model name or path
input_path str Yes Input JSONL file or directory
output_dir str Yes Output directory for split files
cache_dir str No Cache directory for datasets library
num_proc int No Number of parallel processes (default: 16)
length_list list No Length boundaries for splitting (default: [0, 500, 1000, ...])
overwrite bool No Whether to overwrite existing files (default: False)

Outputs

Name Type Description
split_files List[str] JSONL files named with length ranges (e.g., file_len-0-500.jsonl)
log_file str JSON log file with split statistics

Usage Examples

# Example 1: Command-line usage
# python split_data_by_length.py \
#   --input_path train_data/data.jsonl \
#   --output_dir train_data_split \
#   --model_name_or_path BAAI/bge-m3 \
#   --cache_dir .cache \
#   --length_list 0 500 1000 2000 3000 4000 5000 \
#   --num_proc 16 \
#   --overwrite

# Example 2: Programmatic usage
from split_data_by_length import SplitByLengthHandler

handler = SplitByLengthHandler(
    model_name_or_path="BAAI/bge-m3",
    cache_dir=".cache",
    num_proc=16,
    length_list=[0, 500, 1000, 2000, 3000, 4000],
    overwrite=False
)

handler.run(
    input_path="./train_data",
    output_dir="./train_data_split",
    log_name=".split_log"
)

# Example 3: Process single file
handler = SplitByLengthHandler(
    model_name_or_path="BAAI/bge-m3",
    num_proc=8,
    length_list=[0, 512, 1024, 2048]
)

handler.run(
    input_path="./data/train.jsonl",
    output_dir="./data/split"
)

# Output files:
# - ./data/split/train_len-0-512.jsonl
# - ./data/split/train_len-512-1024.jsonl
# - ./data/split/train_len-1024-2048.jsonl
# - ./data/split/.split_log

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment