Implementation:OpenBMB UltraFeedback Completion Storage
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Construction |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
Concrete tool for appending generated completions to dataset examples and persisting results as JSON files.
Description
The completion storage logic differs between the two backends:
HuggingFace backend (main.py:L215-222, L253-257): Each completion is appended inline within instruction_completion as a dictionary with keys: model, principle, custom_system_prompt, response. The dataset is serialized using json.dump with indent=4 to the same path it was loaded from.
vLLM backend (main_vllm.py:L185-190, L230-232): Responses are first added as a column using dataset.add_column("response", responses), then merged into the completions array using dataset.map with a lambda that updates the last completion entry. Temporary columns (prompt, response) are removed before saving.
Usage
This logic runs automatically as part of the generation pipeline. No separate invocation is needed.
Code Reference
Source Location
- Repository: UltraFeedback
- File: src/comparison_data_generation/main.py (Lines 215-222 for append, Lines 253-257 for save)
- File: src/comparison_data_generation/main_vllm.py (Lines 185-190 for merge, Lines 230-232 for save)
Signature
# HuggingFace backend: inline append (main.py:L215-222)
example["completions"].append({
"model": model_type,
"principle": principle,
"custom_system_prompt": principle_prompt,
"response": response
})
# HuggingFace backend: save (main.py:L253-257)
result_path = load_path
with open(result_path, "w") as f:
json.dump([{k: v for k, v in data.items()} for data in dataset], f, indent=4)
# vLLM backend: merge + save (main_vllm.py:L185-190, L230-232)
dataset = dataset.add_column("response", responses)
dataset = dataset.map(lambda x: {
"completions": x["completions"][:-1] + [
dict(x["completions"][-1], **{"response": x["response"]})
]
})
dataset = dataset.remove_columns(["prompt", "response"])
with open(result_path, "w") as f:
json.dump([{k: v for k, v in data.items()} for data in dataset_dict], f, indent=4)
Import
import json
import os
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| example["completions"] | List[Dict] | Yes | Existing completions list to append to |
| model_type | str | Yes | Model identifier |
| principle | str | Yes | Principle category name |
| principle_prompt | str | Yes | Full system prompt text |
| response | str | Yes | Generated completion text |
Outputs
| Name | Type | Description |
|---|---|---|
| JSON file | File | Updated JSON at ./completion_data/{subset}.json with all completions |
| Completion dict | Dict | Keys: model (str), principle (str), custom_system_prompt (str), response (str) |
Usage Examples
HuggingFace Backend Save Pattern
import json
# After generation completes
result_path = f"./completion_data/{subset}.json"
with open(result_path, "w") as f:
json.dump(
[{k: v for k, v in data.items()} for data in dataset],
f,
indent=4
)