Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenBMB UltraFeedback Model Pool Sampling

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Construction
Last Updated 2023-10-02 00:00 GMT

Overview

Concrete tool for randomly assigning models from a 17-model pool to each instruction in the UltraFeedback dataset.

Description

The sampling.py module defines a global model_pool list containing 17 model identifiers and uses random.sample within a HuggingFace Dataset .map() call to assign one randomly selected model to each instruction. It also initializes an empty completions list on each example to be populated by the downstream generation step.

Usage

Run this module as a standalone script to assign models to an existing instruction JSON file. It reads from and writes back to the same file, adding models and completions fields to each example.

Code Reference

Source Location

  • Repository: UltraFeedback
  • File: src/comparison_data_generation/sampling.py (Lines 7-25)

Signature

model_pool = [
    "gpt-4", "gpt-3.5-turbo", "bard",
    "ultralm-65b", "wizardlm-30b", "vicuna-33b", "llama-2-70b-chat",
    "ultralm-13b", "wizardlm-13b", "llama-2-13b-chat",
    "wizardlm-7b", "alpaca-7b", "llama-2-7b-chat",
    "falcon-40b-instruct", "starchat", "mpt-30b-chat", "pythia-12b"
]

# Sampling logic (in __main__):
dataset = dataset.map(
    lambda x: {"models": random.sample(model_pool, 1), "completions": []},
    desc=subset
)

Import

from sampling import model_pool
# Also used internally: random, json, pandas, datasets

I/O Contract

Inputs

Name Type Required Description
model_pool List[str] Yes List of 17 model identifier strings
dataset datasets.Dataset Yes HuggingFace Dataset with instruction field
subset str Yes Dataset subset name (used for progress bar description)

Outputs

Name Type Description
dataset datasets.Dataset Same dataset with added fields: models (List[str] of length 1) and completions (empty List)
JSON file File Updated JSON file written back to ./completion_data/{subset}.json

Usage Examples

Standalone Sampling Script

from datasets import Dataset
import pandas as pd
import random
import json

model_pool = [
    "gpt-4", "gpt-3.5-turbo", "bard",
    "ultralm-65b", "wizardlm-30b", "vicuna-33b", "llama-2-70b-chat",
    "ultralm-13b", "wizardlm-13b", "llama-2-13b-chat",
    "wizardlm-7b", "alpaca-7b", "llama-2-7b-chat",
    "falcon-40b-instruct", "starchat", "mpt-30b-chat", "pythia-12b"
]

subset = "sharegpt"
dataset = pd.read_json(f"./completion_data/{subset}.json", lines=True)
dataset = Dataset.from_pandas(pd.DataFrame(dataset))

# Assign one random model and initialize empty completions
dataset = dataset.map(
    lambda x: {"models": random.sample(model_pool, 1), "completions": []},
    desc=subset
)

# Save back to JSON
with open(f"./completion_data/{subset}.json", "w") as f:
    json.dump([{k: v for k, v in data.items()} for data in dataset], f, indent=4)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment