Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Cleanlab Cleanlab CIFAR CNN Architecture

From Leeroopedia
Revision as of 18:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Cleanlab_Cleanlab_CIFAR_CNN_Architecture.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep Learning, Image Classification, Convolutional Neural Networks
Last Updated 2026-02-09 00:00 GMT

Overview

A multi-block convolutional neural network architecture using batch normalization, leaky ReLU activations, and progressive channel scaling, designed as a robust baseline for image classification on small-resolution datasets like CIFAR-10.

Description

The CIFAR CNN architecture follows a well-established pattern in deep learning for image classification on 32x32 pixel images. The design organizes convolutional layers into three blocks that progressively transform spatial and channel dimensions. The first two blocks maintain spatial resolution through padding while increasing feature depth, then reduce spatial dimensions via max pooling. The third block reduces both spatial and channel dimensions before global average pooling collapses the spatial axes entirely.

Key architectural choices include:

  • Leaky ReLU activation: Unlike standard ReLU which zeroes out all negative values, leaky ReLU (with negative slope 0.01) allows a small gradient for negative inputs. This helps prevent the "dying ReLU" problem where neurons can become permanently inactive during training, which is especially important when training on noisy data where loss landscapes may be more complex.
  • Batch normalization: Applied after every convolution and before the activation function, batch normalization stabilizes training by normalizing intermediate activations. This allows higher learning rates and reduces sensitivity to weight initialization, both critical for reliable training convergence on noisy datasets.
  • Dropout regularization: Applied only between blocks (not within), 2D spatial dropout randomly zeroes entire feature maps during training. This prevents co-adaptation of features and provides implicit ensembling, improving generalization especially when some training labels are incorrect.
  • Global average pooling: Instead of flattening the final feature maps and using a large fully-connected layer, global average pooling computes the mean of each feature map. This dramatically reduces parameter count, lowering overfitting risk and enforcing a stronger correspondence between feature maps and output categories.

Usage

This architecture is the right choice when:

  • Working with small-resolution image classification tasks (32x32 pixels), particularly CIFAR-10 or similar datasets.
  • Training on datasets known or suspected to contain label noise, where a robust yet not overly complex architecture is needed.
  • Using the co-teaching training algorithm, which requires two identical model instances to be trained simultaneously.
  • Needing a well-tested baseline model that balances capacity and regularization for noisy label research.

Theoretical Basis

The architecture follows the principle of progressive feature abstraction, where early layers capture low-level features (edges, textures) and deeper layers combine these into higher-level semantic representations. The channel progression pattern (128 -> 256 -> 512 -> 256 -> 128) forms a bottleneck structure:

  • Expansion phase: Channels increase from 128 to 512, expanding the feature representation to capture diverse patterns.
  • Compression phase: Channels decrease from 512 back to 128, forcing the network to distill the most discriminative features.

The final fully-connected layer maps the compressed 128-dimensional feature vector directly to class logits:

logit = W * h + b, where h is the globally-averaged feature vector of dimension 128.

This bottleneck design is particularly effective for noisy label settings because the compression forces the network to learn robust, generalizable features rather than memorizing individual noisy examples. The co-teaching training procedure further exploits this by training two identical architectures that cross-select clean examples for each other.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment