Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pytorch Serve Spm Dataset

From Leeroopedia

Overview

Spm_Dataset provides dataset utilities with SentencePiece tokenization for text classification tasks. It creates text classification datasets by tokenizing CSV data using a SentencePiece model and wrapping the results in TextClassificationDataset objects from torchtext. The module handles both SentencePiece model generation and dataset preparation in a single pipeline.

Field Value
Page Type Implementation
Implementation Type API Doc
Domains Text_Classification, NLP
Knowledge Sources Pytorch_Serve
Workflow Text_Classification_Pipeline
Last Updated 2026-02-13 18:52 GMT

Description

This module provides two key functions that form the backbone of the SentencePiece-based text classification example. The first function handles tokenization of raw CSV data using a SentencePiece generator, while the second orchestrates the full dataset setup pipeline including downloading, SP model generation, and dataset construction.

Key Responsibilities

  • Tokenization: Reads CSV files and tokenizes text content using a SentencePiece model
  • Dataset Construction: Wraps tokenized data and labels into TextClassificationDataset objects
  • SP Model Management: Generates or loads a SentencePiece model from training data
  • Dataset Download: Fetches standard text classification datasets via torchtext utilities

Code Reference

Source Location

File Lines Repository
examples/text_classification/spm_dataset.py L1-51 pytorch/serve

Signature

def _create_data_with_sp_transform(sp_generator, data_path):
    """
    Tokenize CSV data using a SentencePiece model.

    Reads a CSV file, encodes each text entry as integer IDs
    using the SentencePiece generator, and collects labels.

    Args:
        sp_generator: A trained SentencePiece processor instance.
        data_path (str): Path to the CSV data file.

    Returns:
        tuple: (data_list, label_set) where data_list contains
               (label, token_ids) tuples and label_set is the
               set of unique labels found.
    """
    ...

def setup_datasets(dataset_name="AG_NEWS", root=".data", vocab_size=20000, include_unk=False):
    """
    Download dataset, generate/load SentencePiece model, and create
    train/test TextClassificationDataset instances.

    1. Downloads the specified dataset to root directory
    2. Generates or loads a SentencePiece unigram model
    3. Tokenizes train and test CSV files
    4. Returns TextClassificationDataset objects

    Args:
        dataset_name (str): Name of the dataset (default: "AG_NEWS").
        root (str): Root directory for data storage (default: ".data").
        vocab_size (int): SentencePiece vocabulary size (default: 20000).
        include_unk (bool): Whether to include unknown token (default: False).

    Returns:
        tuple: (train_dataset, test_dataset) as TextClassificationDataset objects.
    """
    ...

Import

from examples.text_classification.spm_dataset import setup_datasets

# External dependencies:
import sentencepiece as spm
from torchtext.datasets import TextClassificationDataset

I/O Contract

Function Input Output Notes
_create_data_with_sp_transform(sp_generator, data_path) sp_generator: SentencePiece processor; data_path: str path to CSV tuple: (data_list, label_set) where data_list is list of (label, token_ids) tuples Lines 11-23; reads CSV row-by-row
setup_datasets(dataset_name, root, vocab_size, include_unk) dataset_name: str; root: str; vocab_size: int; include_unk: bool tuple: (train_dataset, test_dataset) as TextClassificationDataset instances Lines 26-51; handles SP model generation and loading

Usage Examples

Example 1: Basic Dataset Setup

from examples.text_classification.spm_dataset import setup_datasets

# Create AG_NEWS datasets with SentencePiece tokenization
train_dataset, test_dataset = setup_datasets(
    dataset_name="AG_NEWS",
    root=".data",
    vocab_size=20000,
    include_unk=False
)

Example 2: Custom Vocabulary Size

from examples.text_classification.spm_dataset import setup_datasets

# Use a smaller vocabulary for faster training
train_dataset, test_dataset = setup_datasets(
    dataset_name="AG_NEWS",
    vocab_size=10000
)

# Access dataset elements
label, tokens = train_dataset[0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment