Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook Data Collection

From Leeroopedia


Knowledge Sources
Domains Computer_Vision, Data_Engineering, Deep_Learning
Last Updated 2026-02-09 17:00 GMT

Overview

Data collection is the process of gathering, downloading, and validating raw image samples that will serve as the training and validation corpus for a supervised image classification model.

Description

Before any model can be trained, a practitioner must assemble a dataset of labeled images that represents the categories the model should distinguish. Data collection for image classification encompasses three sub-tasks:

  1. Discovery -- locating candidate images via web search APIs, existing datasets, or manual photography.
  2. Download -- transferring image files from remote URLs to local storage in an organized folder structure.
  3. Verification -- checking that every downloaded file is a valid, non-corrupted image that can be opened and decoded by the training framework.

The quality and diversity of collected data directly determines the ceiling of model performance. No amount of architectural sophistication or training tricks can compensate for a dataset that is too small, mislabeled, or unrepresentative of the real-world distribution.

Usage

Use this technique at the very start of any image classification project when you do not already possess a curated dataset. It is also appropriate when augmenting an existing dataset with additional categories, or when refreshing stale training data with more recent images.

Theoretical Basis

Minimum Sample Size

While no universal formula exists, practical guidance from the fastai course suggests that 150 images per category is a reasonable starting point for transfer learning with a pretrained convolutional neural network. The pretrained backbone already encodes general visual features; the custom head only needs enough examples to learn the decision boundary between categories.

Label-by-Folder Convention

The simplest labeling scheme stores images in subdirectories named after their class:

dataset/
  grizzly/
    001.jpg
    002.jpg
  black/
    001.jpg
    002.jpg
  teddy/
    001.jpg
    002.jpg

This convention allows the label to be derived from the parent folder name, eliminating the need for a separate labels file.

Verification Logic

Downloaded images may be corrupted (truncated transfers, server errors returning HTML instead of image bytes). Verification opens each file with an image decoder and marks any file that raises an exception. The pseudocode is:

for each file in image_folder:
    try:
        open_image(file)
        verify_dimensions(file)
    except DecodingError:
        mark_as_failed(file)
remove all marked files

Removing corrupted files before training prevents cryptic data-loading errors during the training loop.

Search Diversity

When using web search APIs, issuing multiple query variants for the same category (e.g., "grizzly bear", "grizzly bear photo", "grizzly bear wildlife") increases visual diversity and reduces the chance of downloading near-duplicate images that would inflate apparent dataset size without adding information.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment