Implementation:Huggingface Optimum Get Dataset
Appearance
Overview
Functions for loading standard calibration datasets and preparing them for GPTQ quantization, including tokenization, batching, and padding.
Source
File: optimum/gptq/data.py
APIs
get_dataset
Lines: 206-245
def get_dataset(
dataset_name: str,
tokenizer: Any,
nsamples: int = 128,
seqlen: int = 2048,
seed: int = 0,
split: str = "train",
) -> List[Dict[str, torch.LongTensor]]:
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset_name |
str |
(required) | Dataset name: "wikitext2", "c4", or "c4-new".
|
tokenizer |
Any |
(required) | Tokenizer for the model being quantized. |
nsamples |
int |
128 |
Number of calibration samples to extract. |
seqlen |
int |
2048 |
Sequence length for each sample. |
seed |
int |
0 |
Random seed for reproducibility. |
split |
str |
"train" |
Dataset split: "train" or "validation".
|
Behavior:
- Sets random seeds for
random,numpy, andtorchto ensure reproducibility. - Looks up the dataset name in an internal dispatch map:
"wikitext2"→get_wikitext2()"c4"→get_c4()"c4-new"→get_c4_new()
- Validates that
splitis either"train"or"validation". - Raises
ValueErrorfor deprecated datasets ("ptb","ptb-new") or unknown dataset names. - Calls the appropriate dataset loader, which:
- Loads the dataset via
datasets.load_dataset(). - Tokenizes the text.
- Extracts
nsamplesrandom windows ofseqlentokens. - Returns a list of dictionaries with
input_idsandattention_masktensors.
- Loads the dataset via
prepare_dataset
Lines: 34-64
def prepare_dataset(
examples: List[Dict[str, torch.LongTensor]],
batch_size: int = 1,
pad_token_id: Optional[int] = None,
) -> List[Dict[str, torch.LongTensor]]:
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
examples |
List[Dict[str, torch.LongTensor]] |
(required) | List of tokenized examples with input_ids and attention_mask.
|
batch_size |
int |
1 |
Number of examples per batch. |
pad_token_id |
Optional[int] |
None |
Pad token id. Required when batch_size > 1.
|
Behavior:
- Converts all
input_idsandattention_maskvalues totorch.LongTensor. - Validates that
pad_token_idis provided whenbatch_size > 1. - Groups examples into batches of size
batch_sizeusing thecollate_data()helper. - Pads shorter sequences in each batch to the length of the longest sequence:
input_idsare padded withpad_token_id.attention_maskvalues are padded with0.
Import
from optimum.gptq.data import get_dataset, prepare_dataset
External Dependencies
| Dependency | Usage |
|---|---|
datasets |
Loading standard datasets via load_dataset().
|
numpy |
Random seed management. |
torch |
Tensor operations and random seed management. |
Related
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment