Principle:Turboderp org Exllamav2 Dataset Loading

Knowledge Sources	HuggingFace Datasets: A Community Library for Natural Language Processing
Domains	Data_Loading, NLP, Utilities
Last Updated	2026-02-15 00:00 GMT

Overview

Dataset loading for bulk inference involves fetching structured data from external sources, caching it locally, and formatting prompts according to model-specific chat templates.

Description

Bulk inference workflows require processing large numbers of prompts through a language model. The dataset loading pattern addresses two concerns:

Data Acquisition and Caching: Datasets are fetched from HuggingFace's datasets hub using the datasets library. To avoid repeated downloads on subsequent runs, the data is cached locally as JSONL (JSON Lines) files. On each call, the loading function first checks for the cached file and only downloads from HuggingFace if the cache is missing.

This caching strategy provides:

Reproducibility - The same cached data is used across runs
Offline capability - Once cached, no network access is needed
Speed - Local file reads are much faster than API calls

Prompt Formatting: Raw dataset entries (typically containing a question or instruction) must be formatted into the chat template expected by the target model. Different model families use different prompt formats:

LLaMA format: Uses [INST] and [/INST] delimiters
LLaMA 3 format: Uses <|begin_of_text|> and role-based headers
Granite format: Uses <|start_of_role|> delimiters
ChatML format: Uses <|im_start|> and <|im_end|> delimiters
Gemma format: Uses <start_of_turn> delimiters

The separation of data loading from prompt formatting is a key design decision. It allows the same dataset to be reused across different model configurations simply by changing the format parameter.

Usage

Use dataset loading when performing bulk inference evaluations, benchmarks, or batch processing tasks. The pattern is especially useful for running the same set of prompts through different models or model configurations for comparison.

Theoretical Basis

Dataset Loading Pipeline:

1. CHECK CACHE:
   - Look for local JSONL file at: data/{ds_name}_{category}_{split}.jsonl
   - If exists: load and return

2. DOWNLOAD:
   - Call datasets.load_dataset(ds_name, category, split=split)
   - Convert to list of dicts
   - Write to JSONL cache file

3. RETURN:
   - List of dicts, each representing one dataset row

Prompt Formatting Pipeline:

1. SELECT FORMAT:
   - Match prompt_format string to template function

2. APPLY TEMPLATE:
   - Insert system prompt (sp) and user prompt (p) into format
   - Return formatted string ready for tokenization

Example (ChatML format):
   Input:  sp="You are helpful.", p="What is 2+2?"
   Output: "<|im_start|>system\nYou are helpful.<|im_end|>\n
            <|im_start|>user\nWhat is 2+2?<|im_end|>\n
            <|im_start|>assistant\n"

The JSONL caching format stores one JSON object per line, making it efficient for both sequential reading and appending. Each line represents a complete dataset row that can be parsed independently.

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_Get_Dataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment