Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Snorkel team Snorkel Dict Dataset Preparation

From Leeroopedia
Knowledge Sources
Domains Data_Preparation, Multi_Task_Learning, Deep_Learning
Last Updated 2026-02-14 20:00 GMT

Overview

A data organization pattern that structures multi-field features and multi-task labels as dictionary-indexed datasets for flexible multi-task training.

Description

Dict Dataset Preparation organizes data for multi-task learning using dictionary-based indexing. Unlike standard single-task datasets where features are a single tensor and labels are a single vector, multi-task datasets require:

  • X_dict: Multiple named feature fields (e.g., tokens, embeddings, metadata)
  • Y_dict: Multiple named label sets (one per task)

This dictionary structure allows different tasks to share the same feature fields while having independent label spaces, which is essential for multi-task learning.

Usage

Use this principle when preparing data for any Snorkel MultitaskClassifier. Organize features into named fields and labels into named task labels, then wrap in DictDataLoader for batch iteration.

Theoretical Basis

In multi-task learning, a dataset consists of:

𝒟={(xi(1),,xi(F),yi(1),,yi(T))}i=1n

where F is the number of feature fields and T the number of tasks. The dictionary structure maps:

Xdict={fkX(k)}k=1F,Ydict={tjY(j)}j=1T

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment