Principle:Snorkel team Snorkel Dict Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Preparation, Multi_Task_Learning, Deep_Learning |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
A data organization pattern that structures multi-field features and multi-task labels as dictionary-indexed datasets for flexible multi-task training.
Description
Dict Dataset Preparation organizes data for multi-task learning using dictionary-based indexing. Unlike standard single-task datasets where features are a single tensor and labels are a single vector, multi-task datasets require:
- X_dict: Multiple named feature fields (e.g., tokens, embeddings, metadata)
- Y_dict: Multiple named label sets (one per task)
This dictionary structure allows different tasks to share the same feature fields while having independent label spaces, which is essential for multi-task learning.
Usage
Use this principle when preparing data for any Snorkel MultitaskClassifier. Organize features into named fields and labels into named task labels, then wrap in DictDataLoader for batch iteration.
Theoretical Basis
In multi-task learning, a dataset consists of:
where is the number of feature fields and the number of tasks. The dictionary structure maps: