Overview
Spm_Dataset provides dataset utilities with SentencePiece tokenization for text classification tasks. It creates text classification datasets by tokenizing CSV data using a SentencePiece model and wrapping the results in TextClassificationDataset objects from torchtext. The module handles both SentencePiece model generation and dataset preparation in a single pipeline.
Description
This module provides two key functions that form the backbone of the SentencePiece-based text classification example. The first function handles tokenization of raw CSV data using a SentencePiece generator, while the second orchestrates the full dataset setup pipeline including downloading, SP model generation, and dataset construction.
Key Responsibilities
- Tokenization: Reads CSV files and tokenizes text content using a SentencePiece model
- Dataset Construction: Wraps tokenized data and labels into
TextClassificationDataset objects
- SP Model Management: Generates or loads a SentencePiece model from training data
- Dataset Download: Fetches standard text classification datasets via torchtext utilities
Code Reference
Source Location
| File |
Lines |
Repository
|
examples/text_classification/spm_dataset.py |
L1-51 |
pytorch/serve
|
Signature
def _create_data_with_sp_transform(sp_generator, data_path):
"""
Tokenize CSV data using a SentencePiece model.
Reads a CSV file, encodes each text entry as integer IDs
using the SentencePiece generator, and collects labels.
Args:
sp_generator: A trained SentencePiece processor instance.
data_path (str): Path to the CSV data file.
Returns:
tuple: (data_list, label_set) where data_list contains
(label, token_ids) tuples and label_set is the
set of unique labels found.
"""
...
def setup_datasets(dataset_name="AG_NEWS", root=".data", vocab_size=20000, include_unk=False):
"""
Download dataset, generate/load SentencePiece model, and create
train/test TextClassificationDataset instances.
1. Downloads the specified dataset to root directory
2. Generates or loads a SentencePiece unigram model
3. Tokenizes train and test CSV files
4. Returns TextClassificationDataset objects
Args:
dataset_name (str): Name of the dataset (default: "AG_NEWS").
root (str): Root directory for data storage (default: ".data").
vocab_size (int): SentencePiece vocabulary size (default: 20000).
include_unk (bool): Whether to include unknown token (default: False).
Returns:
tuple: (train_dataset, test_dataset) as TextClassificationDataset objects.
"""
...
Import
from examples.text_classification.spm_dataset import setup_datasets
# External dependencies:
import sentencepiece as spm
from torchtext.datasets import TextClassificationDataset
I/O Contract
| Function |
Input |
Output |
Notes
|
_create_data_with_sp_transform(sp_generator, data_path) |
sp_generator: SentencePiece processor; data_path: str path to CSV |
tuple: (data_list, label_set) where data_list is list of (label, token_ids) tuples |
Lines 11-23; reads CSV row-by-row
|
setup_datasets(dataset_name, root, vocab_size, include_unk) |
dataset_name: str; root: str; vocab_size: int; include_unk: bool |
tuple: (train_dataset, test_dataset) as TextClassificationDataset instances |
Lines 26-51; handles SP model generation and loading
|
Usage Examples
Example 1: Basic Dataset Setup
from examples.text_classification.spm_dataset import setup_datasets
# Create AG_NEWS datasets with SentencePiece tokenization
train_dataset, test_dataset = setup_datasets(
dataset_name="AG_NEWS",
root=".data",
vocab_size=20000,
include_unk=False
)
Example 2: Custom Vocabulary Size
from examples.text_classification.spm_dataset import setup_datasets
# Use a smaller vocabulary for faster training
train_dataset, test_dataset = setup_datasets(
dataset_name="AG_NEWS",
vocab_size=10000
)
# Access dataset elements
label, tokens = train_dataset[0]
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.