Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook DataBlock Construction

From Leeroopedia


Knowledge Sources
Domains Computer_Vision, Data_Engineering, Deep_Learning
Last Updated 2026-02-09 17:00 GMT

Overview

A data pipeline blueprint is a declarative specification that defines how raw data files are discovered, labeled, split, and transformed into model-ready tensors, without immediately loading any data.

Description

Training a supervised model requires answering five questions about the data:

  1. What are the input and output types? -- For image classification, the input type is an image and the output type is a categorical label.
  2. How do we get the raw items? -- A function that returns a list of file paths from a directory.
  3. How do we label each item? -- A function that extracts the class name from each file path (e.g., from the parent folder name or via a regex on the filename).
  4. How do we split into training and validation sets? -- A splitter function that assigns each item to one set, typically at random with a fixed seed for reproducibility.
  5. What transforms should be applied? -- Item-level transforms (applied per image, e.g., resize) and batch-level transforms (applied to a GPU batch, e.g., augmentation).

A data pipeline blueprint encodes the answers to all five questions as a reusable object. It does not touch any data until explicitly asked to materialize into data loaders. This separation of specification from execution allows practitioners to inspect, debug, and modify the pipeline independently of the actual data.

Usage

Construct a data pipeline blueprint whenever you are setting up a new image classification task. The blueprint pattern is especially valuable when you need to experiment with different labeling strategies, augmentation parameters, or train/validation splits without rewriting the data-loading code each time.

Theoretical Basis

The Five-Question Framework

The blueprint answers each question with a composable component:

Question Component Typical Choice (Image Classification)
What types? Block types ImageBlock (input), CategoryBlock (output)
How to get items? Item getter Function returning list of file paths
How to label? Label getter Parent folder name, regex, or CSV lookup
How to split? Splitter Random split with fixed seed (e.g., 80/20)
What transforms? Transform pipeline Resize + random augmentation

Presizing Strategy

A critical detail for image classification is the presizing technique:

  1. Item transform: Resize every image to a larger intermediate size (e.g., 460 pixels) using a random crop. This preserves more information than directly resizing to the final training size.
  2. Batch transform: Apply data augmentation (rotation, flipping, warping, lighting changes) and resize to the final training size (e.g., 224 pixels) on the GPU as a single interpolation step.

This two-step approach avoids performing multiple lossy interpolations (each augmentation that changes geometry requires interpolation) and instead combines all geometric transforms into a single operation, preserving image quality.

Augmentation Rationale

Data augmentation synthetically expands the training set by applying random transformations that preserve the semantic label. For example, a horizontally flipped photo of a cat is still a cat. Common augmentations include:

  • Random horizontal flip
  • Rotation up to 10 degrees
  • Zoom between 1.0x and 1.1x
  • Brightness and contrast jitter
  • Perspective warping

The goal is to make the model invariant to these transformations so it generalizes better to unseen images.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment