Implementation:FlagOpen FlagEmbedding Training Data JSONL Format
Appearance
Interface Specification
Required fields per line:
{"query": str, "pos": List[str], "neg": List[str]}
Optional fields for knowledge distillation:
{"query": str, "pos": List[str], "neg": List[str], "pos_scores": List[float], "neg_scores": List[float]}
Optional field for ICL embedders:
{"query": str, "pos": List[str], "neg": List[str], "prompt": str}
I/O
Input: Raw text data (queries, positive passages, negative passages).
Output: JSONL file with one JSON object per line.
Examples
Creating basic training data:
import json
data = [
{
"query": "What is machine learning?",
"pos": ["Machine learning is a branch of artificial intelligence."],
"neg": ["The stock market closed higher today."]
},
{
"query": "How does photosynthesis work?",
"pos": ["Photosynthesis converts sunlight into chemical energy."],
"neg": ["The car engine uses combustion to generate power."]
}
]
with open("train_data.jsonl", "w") as f:
for item in data:
f.write(json.dumps(item) + "\n")
With distillation scores:
data_with_scores = {
"query": "What is deep learning?",
"pos": ["Deep learning uses neural networks with many layers."],
"neg": ["The weather forecast predicts rain tomorrow."],
"pos_scores": [0.95],
"neg_scores": [0.12]
}
with open("train_data_distill.jsonl", "a") as f:
f.write(json.dumps(data_with_scores) + "\n")
Multiple files:
Training data can be split across multiple JSONL files and passed as a list to the training configuration:
train_data_files = [
"train_data_part1.jsonl",
"train_data_part2.jsonl",
"train_data_part3.jsonl"
]
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment