Implementation:Pytorch Serve Spm Dataset

Overview

Spm_Dataset provides dataset utilities with SentencePiece tokenization for text classification tasks. It creates text classification datasets by tokenizing CSV data using a SentencePiece model and wrapping the results in TextClassificationDataset objects from torchtext. The module handles both SentencePiece model generation and dataset preparation in a single pipeline.

Field	Value
Page Type	Implementation
Implementation Type	API Doc
Domains	Text_Classification, NLP
Knowledge Sources	Pytorch_Serve
Workflow	Text_Classification_Pipeline
Last Updated	2026-02-13 18:52 GMT

Description

This module provides two key functions that form the backbone of the SentencePiece-based text classification example. The first function handles tokenization of raw CSV data using a SentencePiece generator, while the second orchestrates the full dataset setup pipeline including downloading, SP model generation, and dataset construction.

Key Responsibilities

Tokenization: Reads CSV files and tokenizes text content using a SentencePiece model
Dataset Construction: Wraps tokenized data and labels into TextClassificationDataset objects
SP Model Management: Generates or loads a SentencePiece model from training data
Dataset Download: Fetches standard text classification datasets via torchtext utilities

Code Reference

Source Location

File	Lines	Repository
`examples/text_classification/spm_dataset.py`	L1-51	pytorch/serve

Signature

def _create_data_with_sp_transform(sp_generator, data_path):
    """
    Tokenize CSV data using a SentencePiece model.

    Reads a CSV file, encodes each text entry as integer IDs
    using the SentencePiece generator, and collects labels.

    Args:
        sp_generator: A trained SentencePiece processor instance.
        data_path (str): Path to the CSV data file.

    Returns:
        tuple: (data_list, label_set) where data_list contains
               (label, token_ids) tuples and label_set is the
               set of unique labels found.
    """
    ...

def setup_datasets(dataset_name="AG_NEWS", root=".data", vocab_size=20000, include_unk=False):
    """
    Download dataset, generate/load SentencePiece model, and create
    train/test TextClassificationDataset instances.

    1. Downloads the specified dataset to root directory
    2. Generates or loads a SentencePiece unigram model
    3. Tokenizes train and test CSV files
    4. Returns TextClassificationDataset objects

    Args:
        dataset_name (str): Name of the dataset (default: "AG_NEWS").
        root (str): Root directory for data storage (default: ".data").
        vocab_size (int): SentencePiece vocabulary size (default: 20000).
        include_unk (bool): Whether to include unknown token (default: False).

    Returns:
        tuple: (train_dataset, test_dataset) as TextClassificationDataset objects.
    """
    ...

Import

from examples.text_classification.spm_dataset import setup_datasets

# External dependencies:
import sentencepiece as spm
from torchtext.datasets import TextClassificationDataset

I/O Contract

Function	Input	Output	Notes
`_create_data_with_sp_transform(sp_generator, data_path)`	`sp_generator`: SentencePiece processor; `data_path`: str path to CSV	`tuple`: (data_list, label_set) where data_list is list of (label, token_ids) tuples	Lines 11-23; reads CSV row-by-row
`setup_datasets(dataset_name, root, vocab_size, include_unk)`	`dataset_name`: str; `root`: str; `vocab_size`: int; `include_unk`: bool	`tuple`: (train_dataset, test_dataset) as `TextClassificationDataset` instances	Lines 26-51; handles SP model generation and loading

Usage Examples

Example 1: Basic Dataset Setup

from examples.text_classification.spm_dataset import setup_datasets

# Create AG_NEWS datasets with SentencePiece tokenization
train_dataset, test_dataset = setup_datasets(
    dataset_name="AG_NEWS",
    root=".data",
    vocab_size=20000,
    include_unk=False
)

Example 2: Custom Vocabulary Size

from examples.text_classification.spm_dataset import setup_datasets

# Use a smaller vocabulary for faster training
train_dataset, test_dataset = setup_datasets(
    dataset_name="AG_NEWS",
    vocab_size=10000
)

# Access dataset elements
label, tokens = train_dataset[0]

Related Pages

Principle:Pytorch_Serve_Text_Classification - The text classification principle this implementation supports
Implementation:Pytorch_Serve_Scriptable_Tokenizer_Handler - Alternative text classification handler using scriptable tokenizer
Implementation:Pytorch_Serve_BaseHandler - Base handler class for inference pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment