Principle:Snorkel team Snorkel Dict Dataset Preparation

Knowledge Sources	An Overview of Multi-Task Learning in Deep Neural Networks
Domains	Data_Preparation, Multi_Task_Learning, Deep_Learning
Last Updated	2026-02-14 20:00 GMT

Overview

A data organization pattern that structures multi-field features and multi-task labels as dictionary-indexed datasets for flexible multi-task training.

Description

Dict Dataset Preparation organizes data for multi-task learning using dictionary-based indexing. Unlike standard single-task datasets where features are a single tensor and labels are a single vector, multi-task datasets require:

X_dict: Multiple named feature fields (e.g., tokens, embeddings, metadata)
Y_dict: Multiple named label sets (one per task)

This dictionary structure allows different tasks to share the same feature fields while having independent label spaces, which is essential for multi-task learning.

Usage

Use this principle when preparing data for any Snorkel MultitaskClassifier. Organize features into named fields and labels into named task labels, then wrap in DictDataLoader for batch iteration.

Theoretical Basis

In multi-task learning, a dataset consists of:

$𝒟 = {(x_{i}^{(1)}, \dots, x_{i}^{(F)}, y_{i}^{(1)}, \dots, y_{i}^{(T)})}_{i = 1}^{n}$

where $F$ is the number of feature fields and $T$ the number of tasks. The dictionary structure maps:

$X_{dict} = {f_{k} \to X^{(k)}}_{k = 1}^{F}, Y_{dict} = {t_{j} \to Y^{(j)}}_{j = 1}^{T}$

Related Pages

Implemented By

Implementation:Snorkel_team_Snorkel_DictDataset_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment