Implementation:Turboderp org Exllamav2 Get Dataset
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, NLP, Utilities |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for loading datasets from HuggingFace with local JSONL caching and formatting prompts with model-specific chat templates, provided by exllamav2's example utilities.
Description
The get_dataset() function wraps HuggingFace's datasets.load_dataset() with a local caching layer. It first checks for a cached JSONL file at a conventional path; if found, it reads from cache. Otherwise, it downloads the dataset from HuggingFace, converts it to a list of dicts, and writes a JSONL cache file for future use.
The format_prompt() function formats a system prompt and user prompt into a model-specific chat template string. It supports several common formats:
- "llama" - LLaMA/LLaMA 2 instruction format with [INST] tags
- "llama3" - LLaMA 3 format with role-based headers
- "granite" - IBM Granite format with <|start_of_role|> tags
- "chatml" - ChatML format with <|im_start|>/<|im_end|> tags
- "gemma" - Google Gemma format with <start_of_turn> tags
These functions are part of the example utilities and are not part of the installable exllamav2 package. They are designed to be copied or adapted into user code.
Usage
Use these utilities when running bulk inference benchmarks or evaluation scripts from the exllamav2 examples directory. For production use, adapt the caching and formatting patterns into your own codebase.
Code Reference
Source Location
- Repository: exllamav2
- File: examples/util.py
- Lines: L51-72 (get_dataset), L4-37 (format_prompt)
Signature
def get_dataset(
ds_name: str,
category: str,
split: str
) -> list:
...
def format_prompt(
prompt_format: str,
sp: str,
p: str
) -> str:
...
Import
# From examples/util.py (not installable; copy pattern into your code)
from util import get_dataset, format_prompt
I/O Contract
Inputs (get_dataset)
| Name | Type | Required | Description |
|---|---|---|---|
| ds_name | str | Yes | HuggingFace dataset name (e.g., "cais/mmlu", "gsm8k") |
| category | str | Yes | Dataset configuration/subset name, or None for datasets without subsets |
| split | str | Yes | Dataset split to load (e.g., "test", "train", "validation") |
Outputs (get_dataset)
| Name | Type | Description |
|---|---|---|
| dataset | list | List of dictionaries, each representing one row from the dataset. Cached locally as a JSONL file at data/{ds_name}_{category}_{split}.jsonl |
Inputs (format_prompt)
| Name | Type | Required | Description |
|---|---|---|---|
| prompt_format | str | Yes | One of "llama", "llama3", "granite", "chatml", "gemma" specifying the chat template to use |
| sp | str | Yes | System prompt text |
| p | str | Yes | User prompt text |
Outputs (format_prompt)
| Name | Type | Description |
|---|---|---|
| formatted_prompt | str | Fully formatted prompt string ready for tokenization, with system and user content inserted into the appropriate chat template |
Dependencies
- datasets (HuggingFace) - For downloading datasets from the HuggingFace hub
- json - For JSONL serialization and deserialization
- os - For file path operations and cache file existence checks
Usage Examples
Basic Dataset Loading
from util import get_dataset, format_prompt
# Load MMLU test set (anatomy subset)
dataset = get_dataset("cais/mmlu", "anatomy", "test")
print(f"Loaded {len(dataset)} examples")
print(dataset[0]) # First example as a dict
Formatting Prompts for Bulk Inference
# Format each dataset entry for a ChatML-compatible model
system_prompt = "You are a helpful assistant. Answer the question concisely."
formatted_prompts = []
for row in dataset:
question = row["question"]
prompt = format_prompt("chatml", system_prompt, question)
formatted_prompts.append(prompt)
Complete Bulk Inference Pipeline
from util import get_dataset, format_prompt
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Cache
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob
from exllamav2 import ExLlamaV2Sampler
# Load model
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)
generator = ExLlamaV2DynamicGenerator(model=model, cache=cache, tokenizer=tokenizer)
# Load dataset
dataset = get_dataset("cais/mmlu", "anatomy", "test")
# Enqueue all prompts as jobs
gen_settings = ExLlamaV2Sampler.Settings(temperature=0.1)
for i, row in enumerate(dataset):
prompt = format_prompt("chatml", "Answer concisely.", row["question"])
input_ids = tokenizer.encode(prompt)
job = ExLlamaV2DynamicJob(
input_ids=input_ids,
max_new_tokens=200,
gen_settings=gen_settings,
stop_conditions=[tokenizer.eos_token_id],
identifier=i
)
generator.enqueue(job)
# Collect results
results = {}
while generator.num_remaining_jobs() > 0:
for result in generator.iterate():
if result["eos"]:
results[result["identifier"]] = result["full_completion"]