Implementation:Sail sg LongSpec MultiMappingDataset
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Training |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for loading and formatting training data through composable read functions, alignment transformations, and template-based field mapping.
Description
MultiMappingDataset is a PyTorch Dataset class that implements the composable data loading pipeline. It reads raw data files using pluggable reader functions, applies alignment transformations, and formats output using string templates. The class supports:
- Pluggable read_fn for different file formats (JSONL, JSON, HuggingFace datasets)
- Pluggable aligner for data transformation (ID assignment, field restructuring)
- Template-based formatting with key-value mapping for flexible field composition
- Distributed splitting via split_size/split_id for multi-GPU training
- Data deduplication via flush_file to skip already-processed records
Usage
Import when setting up data loading for GLIDE draft model training. Typically instantiated via Hydra config rather than directly:
# In Hydra YAML config:
dataset:
_target_: data.combine_dataset.MultiMappingDataset
file_path: /data/SlimPajama-6B/train_data.jsonl
read_fn:
_target_: data.input_utils.jsonl_read_fn
aligner:
_target_: data.input_aligner.add_id_aligner
template:
input: "{text}"
Code Reference
Source Location
- Repository: LongSpec
- File: longspec/train/data/combine_dataset.py
- Lines: L202-290
Signature
class MultiMappingDataset(Dataset):
def __init__(
self,
file_path: str,
tokenizer: PreTrainedTokenizer,
template: Dict[str, str] = None,
aligner: Callable = empty_aligner,
instruction: str = "",
few_shot_prompt: str = "",
api_based: bool = False,
service_based: bool = False,
service_processor: Callable = None,
flush_file: str = None,
split_size: int = -1,
split_id: int = 0,
index_field: str = "id",
max_data_num: int = -1,
read_fn: Callable = json_read_fn,
kv_mapping: Dict[str, str] = None,
) -> None:
"""
Args:
file_path: Path to training data file (JSONL, JSON, etc.)
tokenizer: HuggingFace tokenizer for text processing
template: Dict mapping output keys to format strings with {field} placeholders
aligner: Callable transforming raw data records (default: identity)
instruction: Optional instruction prefix for all samples
few_shot_prompt: Optional few-shot examples prefix
api_based: If True, delegates to api_getitem (not used in GLIDE training)
service_based: If True, delegates to service_getitem (not used)
service_processor: Callable for service-based processing
flush_file: Path to file listing already-processed record IDs
split_size: Number of distributed splits (-1 = no splitting)
split_id: Which split to use (0-indexed)
index_field: Field name for record identification (default: "id")
max_data_num: Maximum records to load (-1 = all)
read_fn: Callable factory returning a data loader function
kv_mapping: Optional key remapping dict for output fields
"""
Import
from longspec.train.data.combine_dataset import MultiMappingDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | str | Yes | Path to raw training data file |
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer instance (stored but not used in __getitem__) |
| template | Dict[str, str] | No | Format strings mapping output keys to data fields |
| aligner | Callable | No | Data transformation function (default: identity) |
| read_fn | Callable | No | Factory returning a file reader (default: json_read_fn) |
| kv_mapping | Dict[str, str] | No | Output key remapping dictionary |
Outputs
| Name | Type | Description |
|---|---|---|
| __getitem__ returns | Dict[str, str] | Dictionary with template-formatted fields and meta_data containing original record |
| __len__ returns | int | Number of samples (respecting max_data_num limit) |
Usage Examples
Basic JSONL Loading
from longspec.train.data.combine_dataset import MultiMappingDataset
from longspec.train.data.input_utils import jsonl_read_fn
from longspec.train.data.input_aligner import add_id_aligner
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")
dataset = MultiMappingDataset(
file_path="/data/SlimPajama-6B/train_data.jsonl",
tokenizer=tokenizer,
template={"input": "{text}"},
aligner=add_id_aligner(),
read_fn=jsonl_read_fn(),
)
# Access a sample
sample = dataset[0]
# Returns: {"input": "<formatted text>", "meta_data": {...}}
With Distributed Splitting
# Split data across 8 GPUs
dataset = MultiMappingDataset(
file_path="/data/train.jsonl",
tokenizer=tokenizer,
template={"input": "{text}"},
read_fn=jsonl_read_fn(),
split_size=8, # Total number of splits
split_id=rank, # Current GPU rank (0-7)
max_data_num=10000, # Limit for debugging
)
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment