Implementation:FlagOpen FlagEmbedding Training Data JSONL Format

Interface Specification

Required fields per line:

{"query": str, "pos": List[str], "neg": List[str]}

Optional fields for knowledge distillation:

{"query": str, "pos": List[str], "neg": List[str], "pos_scores": List[float], "neg_scores": List[float]}

Optional field for ICL embedders:

{"query": str, "pos": List[str], "neg": List[str], "prompt": str}

I/O

Input: Raw text data (queries, positive passages, negative passages).

Output: JSONL file with one JSON object per line.

Examples

Creating basic training data:

import json

data = [
    {
        "query": "What is machine learning?",
        "pos": ["Machine learning is a branch of artificial intelligence."],
        "neg": ["The stock market closed higher today."]
    },
    {
        "query": "How does photosynthesis work?",
        "pos": ["Photosynthesis converts sunlight into chemical energy."],
        "neg": ["The car engine uses combustion to generate power."]
    }
]

with open("train_data.jsonl", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

With distillation scores:

data_with_scores = {
    "query": "What is deep learning?",
    "pos": ["Deep learning uses neural networks with many layers."],
    "neg": ["The weather forecast predicts rain tomorrow."],
    "pos_scores": [0.95],
    "neg_scores": [0.12]
}

with open("train_data_distill.jsonl", "a") as f:
    f.write(json.dumps(data_with_scores) + "\n")

Multiple files:

Training data can be split across multiple JSONL files and passed as a list to the training configuration:

train_data_files = [
    "train_data_part1.jsonl",
    "train_data_part2.jsonl",
    "train_data_part3.jsonl"
]

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Interface Specification

I/O

Examples

Related Pages

Page Connections