Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Optimum Get Dataset

From Leeroopedia

Overview

Functions for loading standard calibration datasets and preparing them for GPTQ quantization, including tokenization, batching, and padding.

Source

File: optimum/gptq/data.py

APIs

get_dataset

Lines: 206-245

def get_dataset(
    dataset_name: str,
    tokenizer: Any,
    nsamples: int = 128,
    seqlen: int = 2048,
    seed: int = 0,
    split: str = "train",
) -> List[Dict[str, torch.LongTensor]]:

Parameters:

Parameter Type Default Description
dataset_name str (required) Dataset name: "wikitext2", "c4", or "c4-new".
tokenizer Any (required) Tokenizer for the model being quantized.
nsamples int 128 Number of calibration samples to extract.
seqlen int 2048 Sequence length for each sample.
seed int 0 Random seed for reproducibility.
split str "train" Dataset split: "train" or "validation".

Behavior:

  1. Sets random seeds for random, numpy, and torch to ensure reproducibility.
  2. Looks up the dataset name in an internal dispatch map:
    • "wikitext2"get_wikitext2()
    • "c4"get_c4()
    • "c4-new"get_c4_new()
  3. Validates that split is either "train" or "validation".
  4. Raises ValueError for deprecated datasets ("ptb", "ptb-new") or unknown dataset names.
  5. Calls the appropriate dataset loader, which:
    • Loads the dataset via datasets.load_dataset().
    • Tokenizes the text.
    • Extracts nsamples random windows of seqlen tokens.
    • Returns a list of dictionaries with input_ids and attention_mask tensors.

prepare_dataset

Lines: 34-64

def prepare_dataset(
    examples: List[Dict[str, torch.LongTensor]],
    batch_size: int = 1,
    pad_token_id: Optional[int] = None,
) -> List[Dict[str, torch.LongTensor]]:

Parameters:

Parameter Type Default Description
examples List[Dict[str, torch.LongTensor]] (required) List of tokenized examples with input_ids and attention_mask.
batch_size int 1 Number of examples per batch.
pad_token_id Optional[int] None Pad token id. Required when batch_size > 1.

Behavior:

  1. Converts all input_ids and attention_mask values to torch.LongTensor.
  2. Validates that pad_token_id is provided when batch_size > 1.
  3. Groups examples into batches of size batch_size using the collate_data() helper.
  4. Pads shorter sequences in each batch to the length of the longest sequence:
    • input_ids are padded with pad_token_id.
    • attention_mask values are padded with 0.

Import

from optimum.gptq.data import get_dataset, prepare_dataset

External Dependencies

Dependency Usage
datasets Loading standard datasets via load_dataset().
numpy Random seed management.
torch Tensor operations and random seed management.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment