Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Get Dataset

From Leeroopedia
Knowledge Sources
Domains Data_Loading, NLP, Utilities
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for loading datasets from HuggingFace with local JSONL caching and formatting prompts with model-specific chat templates, provided by exllamav2's example utilities.

Description

The get_dataset() function wraps HuggingFace's datasets.load_dataset() with a local caching layer. It first checks for a cached JSONL file at a conventional path; if found, it reads from cache. Otherwise, it downloads the dataset from HuggingFace, converts it to a list of dicts, and writes a JSONL cache file for future use.

The format_prompt() function formats a system prompt and user prompt into a model-specific chat template string. It supports several common formats:

  • "llama" - LLaMA/LLaMA 2 instruction format with [INST] tags
  • "llama3" - LLaMA 3 format with role-based headers
  • "granite" - IBM Granite format with <|start_of_role|> tags
  • "chatml" - ChatML format with <|im_start|>/<|im_end|> tags
  • "gemma" - Google Gemma format with <start_of_turn> tags

These functions are part of the example utilities and are not part of the installable exllamav2 package. They are designed to be copied or adapted into user code.

Usage

Use these utilities when running bulk inference benchmarks or evaluation scripts from the exllamav2 examples directory. For production use, adapt the caching and formatting patterns into your own codebase.

Code Reference

Source Location

  • Repository: exllamav2
  • File: examples/util.py
  • Lines: L51-72 (get_dataset), L4-37 (format_prompt)

Signature

def get_dataset(
    ds_name: str,
    category: str,
    split: str
) -> list:
    ...

def format_prompt(
    prompt_format: str,
    sp: str,
    p: str
) -> str:
    ...

Import

# From examples/util.py (not installable; copy pattern into your code)
from util import get_dataset, format_prompt

I/O Contract

Inputs (get_dataset)

Name Type Required Description
ds_name str Yes HuggingFace dataset name (e.g., "cais/mmlu", "gsm8k")
category str Yes Dataset configuration/subset name, or None for datasets without subsets
split str Yes Dataset split to load (e.g., "test", "train", "validation")

Outputs (get_dataset)

Name Type Description
dataset list List of dictionaries, each representing one row from the dataset. Cached locally as a JSONL file at data/{ds_name}_{category}_{split}.jsonl

Inputs (format_prompt)

Name Type Required Description
prompt_format str Yes One of "llama", "llama3", "granite", "chatml", "gemma" specifying the chat template to use
sp str Yes System prompt text
p str Yes User prompt text

Outputs (format_prompt)

Name Type Description
formatted_prompt str Fully formatted prompt string ready for tokenization, with system and user content inserted into the appropriate chat template

Dependencies

  • datasets (HuggingFace) - For downloading datasets from the HuggingFace hub
  • json - For JSONL serialization and deserialization
  • os - For file path operations and cache file existence checks

Usage Examples

Basic Dataset Loading

from util import get_dataset, format_prompt

# Load MMLU test set (anatomy subset)
dataset = get_dataset("cais/mmlu", "anatomy", "test")
print(f"Loaded {len(dataset)} examples")
print(dataset[0])  # First example as a dict

Formatting Prompts for Bulk Inference

# Format each dataset entry for a ChatML-compatible model
system_prompt = "You are a helpful assistant. Answer the question concisely."

formatted_prompts = []
for row in dataset:
    question = row["question"]
    prompt = format_prompt("chatml", system_prompt, question)
    formatted_prompts.append(prompt)

Complete Bulk Inference Pipeline

from util import get_dataset, format_prompt
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Cache
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob
from exllamav2 import ExLlamaV2Sampler

# Load model
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)
generator = ExLlamaV2DynamicGenerator(model=model, cache=cache, tokenizer=tokenizer)

# Load dataset
dataset = get_dataset("cais/mmlu", "anatomy", "test")

# Enqueue all prompts as jobs
gen_settings = ExLlamaV2Sampler.Settings(temperature=0.1)
for i, row in enumerate(dataset):
    prompt = format_prompt("chatml", "Answer concisely.", row["question"])
    input_ids = tokenizer.encode(prompt)
    job = ExLlamaV2DynamicJob(
        input_ids=input_ids,
        max_new_tokens=200,
        gen_settings=gen_settings,
        stop_conditions=[tokenizer.eos_token_id],
        identifier=i
    )
    generator.enqueue(job)

# Collect results
results = {}
while generator.num_remaining_jobs() > 0:
    for result in generator.iterate():
        if result["eos"]:
            results[result["identifier"]] = result["full_completion"]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment