Principle:Turboderp org Exllamav2 Dataset Loading
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, NLP, Utilities |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Dataset loading for bulk inference involves fetching structured data from external sources, caching it locally, and formatting prompts according to model-specific chat templates.
Description
Bulk inference workflows require processing large numbers of prompts through a language model. The dataset loading pattern addresses two concerns:
Data Acquisition and Caching: Datasets are fetched from HuggingFace's datasets hub using the datasets library. To avoid repeated downloads on subsequent runs, the data is cached locally as JSONL (JSON Lines) files. On each call, the loading function first checks for the cached file and only downloads from HuggingFace if the cache is missing.
This caching strategy provides:
- Reproducibility - The same cached data is used across runs
- Offline capability - Once cached, no network access is needed
- Speed - Local file reads are much faster than API calls
Prompt Formatting: Raw dataset entries (typically containing a question or instruction) must be formatted into the chat template expected by the target model. Different model families use different prompt formats:
- LLaMA format: Uses [INST] and [/INST] delimiters
- LLaMA 3 format: Uses <|begin_of_text|> and role-based headers
- Granite format: Uses <|start_of_role|> delimiters
- ChatML format: Uses <|im_start|> and <|im_end|> delimiters
- Gemma format: Uses <start_of_turn> delimiters
The separation of data loading from prompt formatting is a key design decision. It allows the same dataset to be reused across different model configurations simply by changing the format parameter.
Usage
Use dataset loading when performing bulk inference evaluations, benchmarks, or batch processing tasks. The pattern is especially useful for running the same set of prompts through different models or model configurations for comparison.
Theoretical Basis
Dataset Loading Pipeline:
1. CHECK CACHE:
- Look for local JSONL file at: data/{ds_name}_{category}_{split}.jsonl
- If exists: load and return
2. DOWNLOAD:
- Call datasets.load_dataset(ds_name, category, split=split)
- Convert to list of dicts
- Write to JSONL cache file
3. RETURN:
- List of dicts, each representing one dataset row
Prompt Formatting Pipeline:
1. SELECT FORMAT:
- Match prompt_format string to template function
2. APPLY TEMPLATE:
- Insert system prompt (sp) and user prompt (p) into format
- Return formatted string ready for tokenization
Example (ChatML format):
Input: sp="You are helpful.", p="What is 2+2?"
Output: "<|im_start|>system\nYou are helpful.<|im_end|>\n
<|im_start|>user\nWhat is 2+2?<|im_end|>\n
<|im_start|>assistant\n"
The JSONL caching format stores one JSON object per line, making it efficient for both sequential reading and appending. Each line represents a complete dataset row that can be parsed independently.