Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LaurentMazare Tch rs YOLO Object Detection

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Computer Vision, Object Detection
Last Updated 2026-02-08 00:00 GMT

Overview

Single-pass object detection divides an image into a grid and simultaneously predicts bounding boxes, objectness scores, and class probabilities for all cells in one forward pass.

Description

YOLO (You Only Look Once) reformulates object detection as a single regression problem rather than the traditional two-stage approach of region proposal followed by classification. The input image is divided into an S×S grid. Each grid cell is responsible for predicting objects whose center falls within that cell.

For each grid cell, the network predicts B bounding boxes, each consisting of:

  • Center coordinates (x, y) relative to the grid cell
  • Width and height (w, h) relative to the full image, often predicted as offsets from anchor boxes (pre-defined aspect ratios)
  • An objectness score indicating confidence that the box contains an object
  • Class probabilities for each of the C object categories

The predictions are made in a single forward pass through the network, making YOLO significantly faster than two-stage detectors. The output tensor has shape S×S×B×(5+C), where 5 accounts for the four box coordinates plus objectness.

Non-maximum suppression (NMS) is applied as a post-processing step to remove duplicate detections. When multiple bounding boxes overlap significantly (measured by intersection over union), only the box with the highest confidence score is retained.

Usage

Apply the YOLO detection principle when:

  • Real-time object detection is required (video streams, robotics, autonomous driving)
  • Speed is prioritized over maximum accuracy on small or overlapping objects
  • Detecting objects across multiple scales using feature pyramid approaches
  • A single unified architecture is preferred over multi-stage pipelines

Theoretical Basis

Grid-Based Prediction

The image is divided into an S×S grid. Each cell predicts B bounding boxes. Each box prediction includes:

(tx,ty,tw,th,to)

These raw predictions are transformed using anchor box priors (pw,ph):

bx=σ(tx)+cx

by=σ(ty)+cy

bw=pwetw

bh=pheth

where (cx,cy) is the top-left corner of the grid cell and σ is the sigmoid function.

Objectness and Class Prediction

The objectness score is:

P(object)=σ(to)

Class probabilities are predicted per cell and combined with objectness:

P(classi|object)P(object)=P(classi)

Intersection over Union (IoU)

IoU measures the overlap between predicted box Bp and ground truth box Bgt:

IoU=|BpBgt||BpBgt|

Non-Maximum Suppression

After prediction, NMS filters redundant boxes:

  1. Sort all detections by confidence score
  2. Select the highest-scoring detection
  3. Remove all other detections with IoU above a threshold (e.g., 0.5) with the selected detection
  4. Repeat until no detections remain

Multi-Scale Detection

YOLOv3 predicts at three different scales by extracting features from different depths of the network. This enables detection of objects at varying sizes, with deeper features detecting larger objects and shallower features detecting smaller ones.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment