Implementation:OpenBMB UltraFeedback Completion Storage

Knowledge Sources	UltraFeedback
Domains	NLP, Data_Construction
Last Updated	2023-10-02 00:00 GMT

Overview

Concrete tool for appending generated completions to dataset examples and persisting results as JSON files.

Description

The completion storage logic differs between the two backends:

HuggingFace backend (main.py:L215-222, L253-257): Each completion is appended inline within instruction_completion as a dictionary with keys: model, principle, custom_system_prompt, response. The dataset is serialized using json.dump with indent=4 to the same path it was loaded from.

vLLM backend (main_vllm.py:L185-190, L230-232): Responses are first added as a column using dataset.add_column("response", responses), then merged into the completions array using dataset.map with a lambda that updates the last completion entry. Temporary columns (prompt, response) are removed before saving.

Usage

This logic runs automatically as part of the generation pipeline. No separate invocation is needed.

Code Reference

Source Location

Repository: UltraFeedback
File: src/comparison_data_generation/main.py (Lines 215-222 for append, Lines 253-257 for save)
File: src/comparison_data_generation/main_vllm.py (Lines 185-190 for merge, Lines 230-232 for save)

Signature

# HuggingFace backend: inline append (main.py:L215-222)
example["completions"].append({
    "model": model_type,
    "principle": principle,
    "custom_system_prompt": principle_prompt,
    "response": response
})

# HuggingFace backend: save (main.py:L253-257)
result_path = load_path
with open(result_path, "w") as f:
    json.dump([{k: v for k, v in data.items()} for data in dataset], f, indent=4)

# vLLM backend: merge + save (main_vllm.py:L185-190, L230-232)
dataset = dataset.add_column("response", responses)
dataset = dataset.map(lambda x: {
    "completions": x["completions"][:-1] + [
        dict(x["completions"][-1], **{"response": x["response"]})
    ]
})
dataset = dataset.remove_columns(["prompt", "response"])

with open(result_path, "w") as f:
    json.dump([{k: v for k, v in data.items()} for data in dataset_dict], f, indent=4)

Import

import json
import os

I/O Contract

Inputs

Name	Type	Required	Description
example["completions"]	List[Dict]	Yes	Existing completions list to append to
model_type	str	Yes	Model identifier
principle	str	Yes	Principle category name
principle_prompt	str	Yes	Full system prompt text
response	str	Yes	Generated completion text

Outputs

Name	Type	Description
JSON file	File	Updated JSON at ./completion_data/{subset}.json with all completions
Completion dict	Dict	Keys: model (str), principle (str), custom_system_prompt (str), response (str)

Usage Examples

HuggingFace Backend Save Pattern

import json

# After generation completes
result_path = f"./completion_data/{subset}.json"
with open(result_path, "w") as f:
    json.dump(
        [{k: v for k, v in data.items()} for data in dataset],
        f,
        indent=4
    )

Related Pages

Implements Principle

Principle:OpenBMB_UltraFeedback_Result_Collection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment