Implementation:Huggingface Optimum Get Dataset

Overview

Functions for loading standard calibration datasets and preparing them for GPTQ quantization, including tokenization, batching, and padding.

Source

File: optimum/gptq/data.py

APIs

get_dataset

Lines: 206-245

def get_dataset(
    dataset_name: str,
    tokenizer: Any,
    nsamples: int = 128,
    seqlen: int = 2048,
    seed: int = 0,
    split: str = "train",
) -> List[Dict[str, torch.LongTensor]]:

Parameters:

Parameter	Type	Default	Description
`dataset_name`	`str`	(required)	Dataset name: `"wikitext2"`, `"c4"`, or `"c4-new"`.
`tokenizer`	`Any`	(required)	Tokenizer for the model being quantized.
`nsamples`	`int`	`128`	Number of calibration samples to extract.
`seqlen`	`int`	`2048`	Sequence length for each sample.
`seed`	`int`	`0`	Random seed for reproducibility.
`split`	`str`	`"train"`	Dataset split: `"train"` or `"validation"`.

Behavior:

Sets random seeds for random, numpy, and torch to ensure reproducibility.
Looks up the dataset name in an internal dispatch map:
- "wikitext2" → get_wikitext2()
- "c4" → get_c4()
- "c4-new" → get_c4_new()
Validates that split is either "train" or "validation".
Raises ValueError for deprecated datasets ("ptb", "ptb-new") or unknown dataset names.
Calls the appropriate dataset loader, which:
- Loads the dataset via datasets.load_dataset().
- Tokenizes the text.
- Extracts nsamples random windows of seqlen tokens.
- Returns a list of dictionaries with input_ids and attention_mask tensors.

prepare_dataset

Lines: 34-64

def prepare_dataset(
    examples: List[Dict[str, torch.LongTensor]],
    batch_size: int = 1,
    pad_token_id: Optional[int] = None,
) -> List[Dict[str, torch.LongTensor]]:

Parameters:

Parameter	Type	Default	Description
`examples`	`List[Dict[str, torch.LongTensor]]`	(required)	List of tokenized examples with `input_ids` and `attention_mask`.
`batch_size`	`int`	`1`	Number of examples per batch.
`pad_token_id`	`Optional[int]`	`None`	Pad token id. Required when `batch_size > 1`.

Behavior:

Converts all input_ids and attention_mask values to torch.LongTensor.
Validates that pad_token_id is provided when batch_size > 1.
Groups examples into batches of size batch_size using the collate_data() helper.
Pads shorter sequences in each batch to the length of the longest sequence:
- input_ids are padded with pad_token_id.
- attention_mask values are padded with 0.

Import

from optimum.gptq.data import get_dataset, prepare_dataset

External Dependencies

Dependency	Usage
`datasets`	Loading standard datasets via `load_dataset()`.
`numpy`	Random seed management.
`torch`	Tensor operations and random seed management.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment