Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sail sg LongSpec MultiMappingDataset

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Training
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for loading and formatting training data through composable read functions, alignment transformations, and template-based field mapping.

Description

MultiMappingDataset is a PyTorch Dataset class that implements the composable data loading pipeline. It reads raw data files using pluggable reader functions, applies alignment transformations, and formats output using string templates. The class supports:

  • Pluggable read_fn for different file formats (JSONL, JSON, HuggingFace datasets)
  • Pluggable aligner for data transformation (ID assignment, field restructuring)
  • Template-based formatting with key-value mapping for flexible field composition
  • Distributed splitting via split_size/split_id for multi-GPU training
  • Data deduplication via flush_file to skip already-processed records

Usage

Import when setting up data loading for GLIDE draft model training. Typically instantiated via Hydra config rather than directly:

# In Hydra YAML config:
dataset:
  _target_: data.combine_dataset.MultiMappingDataset
  file_path: /data/SlimPajama-6B/train_data.jsonl
  read_fn:
    _target_: data.input_utils.jsonl_read_fn
  aligner:
    _target_: data.input_aligner.add_id_aligner
  template:
    input: "{text}"

Code Reference

Source Location

  • Repository: LongSpec
  • File: longspec/train/data/combine_dataset.py
  • Lines: L202-290

Signature

class MultiMappingDataset(Dataset):
    def __init__(
        self,
        file_path: str,
        tokenizer: PreTrainedTokenizer,
        template: Dict[str, str] = None,
        aligner: Callable = empty_aligner,
        instruction: str = "",
        few_shot_prompt: str = "",
        api_based: bool = False,
        service_based: bool = False,
        service_processor: Callable = None,
        flush_file: str = None,
        split_size: int = -1,
        split_id: int = 0,
        index_field: str = "id",
        max_data_num: int = -1,
        read_fn: Callable = json_read_fn,
        kv_mapping: Dict[str, str] = None,
    ) -> None:
        """
        Args:
            file_path: Path to training data file (JSONL, JSON, etc.)
            tokenizer: HuggingFace tokenizer for text processing
            template: Dict mapping output keys to format strings with {field} placeholders
            aligner: Callable transforming raw data records (default: identity)
            instruction: Optional instruction prefix for all samples
            few_shot_prompt: Optional few-shot examples prefix
            api_based: If True, delegates to api_getitem (not used in GLIDE training)
            service_based: If True, delegates to service_getitem (not used)
            service_processor: Callable for service-based processing
            flush_file: Path to file listing already-processed record IDs
            split_size: Number of distributed splits (-1 = no splitting)
            split_id: Which split to use (0-indexed)
            index_field: Field name for record identification (default: "id")
            max_data_num: Maximum records to load (-1 = all)
            read_fn: Callable factory returning a data loader function
            kv_mapping: Optional key remapping dict for output fields
        """

Import

from longspec.train.data.combine_dataset import MultiMappingDataset

I/O Contract

Inputs

Name Type Required Description
file_path str Yes Path to raw training data file
tokenizer PreTrainedTokenizer Yes Tokenizer instance (stored but not used in __getitem__)
template Dict[str, str] No Format strings mapping output keys to data fields
aligner Callable No Data transformation function (default: identity)
read_fn Callable No Factory returning a file reader (default: json_read_fn)
kv_mapping Dict[str, str] No Output key remapping dictionary

Outputs

Name Type Description
__getitem__ returns Dict[str, str] Dictionary with template-formatted fields and meta_data containing original record
__len__ returns int Number of samples (respecting max_data_num limit)

Usage Examples

Basic JSONL Loading

from longspec.train.data.combine_dataset import MultiMappingDataset
from longspec.train.data.input_utils import jsonl_read_fn
from longspec.train.data.input_aligner import add_id_aligner
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")

dataset = MultiMappingDataset(
    file_path="/data/SlimPajama-6B/train_data.jsonl",
    tokenizer=tokenizer,
    template={"input": "{text}"},
    aligner=add_id_aligner(),
    read_fn=jsonl_read_fn(),
)

# Access a sample
sample = dataset[0]
# Returns: {"input": "<formatted text>", "meta_data": {...}}

With Distributed Splitting

# Split data across 8 GPUs
dataset = MultiMappingDataset(
    file_path="/data/train.jsonl",
    tokenizer=tokenizer,
    template={"input": "{text}"},
    read_fn=jsonl_read_fn(),
    split_size=8,       # Total number of splits
    split_id=rank,      # Current GPU rank (0-7)
    max_data_num=10000, # Limit for debugging
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment