Implementation:OpenBMB UltraFeedback Model Pool Sampling

Knowledge Sources	UltraFeedback
Domains	NLP, Data_Construction
Last Updated	2023-10-02 00:00 GMT

Overview

Concrete tool for randomly assigning models from a 17-model pool to each instruction in the UltraFeedback dataset.

Description

The sampling.py module defines a global model_pool list containing 17 model identifiers and uses random.sample within a HuggingFace Dataset .map() call to assign one randomly selected model to each instruction. It also initializes an empty completions list on each example to be populated by the downstream generation step.

Usage

Run this module as a standalone script to assign models to an existing instruction JSON file. It reads from and writes back to the same file, adding models and completions fields to each example.

Code Reference

Source Location

Repository: UltraFeedback
File: src/comparison_data_generation/sampling.py (Lines 7-25)

Signature

model_pool = [
    "gpt-4", "gpt-3.5-turbo", "bard",
    "ultralm-65b", "wizardlm-30b", "vicuna-33b", "llama-2-70b-chat",
    "ultralm-13b", "wizardlm-13b", "llama-2-13b-chat",
    "wizardlm-7b", "alpaca-7b", "llama-2-7b-chat",
    "falcon-40b-instruct", "starchat", "mpt-30b-chat", "pythia-12b"
]

# Sampling logic (in __main__):
dataset = dataset.map(
    lambda x: {"models": random.sample(model_pool, 1), "completions": []},
    desc=subset
)

Import

from sampling import model_pool
# Also used internally: random, json, pandas, datasets

I/O Contract

Inputs

Name	Type	Required	Description
model_pool	List[str]	Yes	List of 17 model identifier strings
dataset	datasets.Dataset	Yes	HuggingFace Dataset with instruction field
subset	str	Yes	Dataset subset name (used for progress bar description)

Outputs

Name	Type	Description
dataset	datasets.Dataset	Same dataset with added fields: models (List[str] of length 1) and completions (empty List)
JSON file	File	Updated JSON file written back to ./completion_data/{subset}.json

Usage Examples

Standalone Sampling Script

from datasets import Dataset
import pandas as pd
import random
import json

model_pool = [
    "gpt-4", "gpt-3.5-turbo", "bard",
    "ultralm-65b", "wizardlm-30b", "vicuna-33b", "llama-2-70b-chat",
    "ultralm-13b", "wizardlm-13b", "llama-2-13b-chat",
    "wizardlm-7b", "alpaca-7b", "llama-2-7b-chat",
    "falcon-40b-instruct", "starchat", "mpt-30b-chat", "pythia-12b"
]

subset = "sharegpt"
dataset = pd.read_json(f"./completion_data/{subset}.json", lines=True)
dataset = Dataset.from_pandas(pd.DataFrame(dataset))

# Assign one random model and initialize empty completions
dataset = dataset.map(
    lambda x: {"models": random.sample(model_pool, 1), "completions": []},
    desc=subset
)

# Save back to JSON
with open(f"./completion_data/{subset}.json", "w") as f:
    json.dump([{k: v for k, v in data.items()} for data in dataset], f, indent=4)

Related Pages

Implements Principle

Principle:OpenBMB_UltraFeedback_Model_Sampling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment