Implementation:OpenBMB UltraFeedback Model Pool Sampling
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Construction |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
Concrete tool for randomly assigning models from a 17-model pool to each instruction in the UltraFeedback dataset.
Description
The sampling.py module defines a global model_pool list containing 17 model identifiers and uses random.sample within a HuggingFace Dataset .map() call to assign one randomly selected model to each instruction. It also initializes an empty completions list on each example to be populated by the downstream generation step.
Usage
Run this module as a standalone script to assign models to an existing instruction JSON file. It reads from and writes back to the same file, adding models and completions fields to each example.
Code Reference
Source Location
- Repository: UltraFeedback
- File: src/comparison_data_generation/sampling.py (Lines 7-25)
Signature
model_pool = [
"gpt-4", "gpt-3.5-turbo", "bard",
"ultralm-65b", "wizardlm-30b", "vicuna-33b", "llama-2-70b-chat",
"ultralm-13b", "wizardlm-13b", "llama-2-13b-chat",
"wizardlm-7b", "alpaca-7b", "llama-2-7b-chat",
"falcon-40b-instruct", "starchat", "mpt-30b-chat", "pythia-12b"
]
# Sampling logic (in __main__):
dataset = dataset.map(
lambda x: {"models": random.sample(model_pool, 1), "completions": []},
desc=subset
)
Import
from sampling import model_pool
# Also used internally: random, json, pandas, datasets
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_pool | List[str] | Yes | List of 17 model identifier strings |
| dataset | datasets.Dataset | Yes | HuggingFace Dataset with instruction field |
| subset | str | Yes | Dataset subset name (used for progress bar description) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | Same dataset with added fields: models (List[str] of length 1) and completions (empty List) |
| JSON file | File | Updated JSON file written back to ./completion_data/{subset}.json |
Usage Examples
Standalone Sampling Script
from datasets import Dataset
import pandas as pd
import random
import json
model_pool = [
"gpt-4", "gpt-3.5-turbo", "bard",
"ultralm-65b", "wizardlm-30b", "vicuna-33b", "llama-2-70b-chat",
"ultralm-13b", "wizardlm-13b", "llama-2-13b-chat",
"wizardlm-7b", "alpaca-7b", "llama-2-7b-chat",
"falcon-40b-instruct", "starchat", "mpt-30b-chat", "pythia-12b"
]
subset = "sharegpt"
dataset = pd.read_json(f"./completion_data/{subset}.json", lines=True)
dataset = Dataset.from_pandas(pd.DataFrame(dataset))
# Assign one random model and initialize empty completions
dataset = dataset.map(
lambda x: {"models": random.sample(model_pool, 1), "completions": []},
desc=subset
)
# Save back to JSON
with open(f"./completion_data/{subset}.json", "w") as f:
json.dump([{k: v for k, v in data.items()} for data in dataset], f, indent=4)