Principle:LaurentMazare Tch rs YOLO Object Detection

Knowledge Sources	LaurentMazare_Tch_rs Redmon et al., 2016 Redmon & Farhadi, 2017 Redmon & Farhadi, 2018
Domains	Deep Learning, Computer Vision, Object Detection
Last Updated	2026-02-08 00:00 GMT

Overview

Single-pass object detection divides an image into a grid and simultaneously predicts bounding boxes, objectness scores, and class probabilities for all cells in one forward pass.

Description

YOLO (You Only Look Once) reformulates object detection as a single regression problem rather than the traditional two-stage approach of region proposal followed by classification. The input image is divided into an $S \times S$ grid. Each grid cell is responsible for predicting objects whose center falls within that cell.

For each grid cell, the network predicts B bounding boxes, each consisting of:

Center coordinates (x, y) relative to the grid cell
Width and height (w, h) relative to the full image, often predicted as offsets from anchor boxes (pre-defined aspect ratios)
An objectness score indicating confidence that the box contains an object
Class probabilities for each of the C object categories

The predictions are made in a single forward pass through the network, making YOLO significantly faster than two-stage detectors. The output tensor has shape $S \times S \times B \times (5 + C)$ , where 5 accounts for the four box coordinates plus objectness.

Non-maximum suppression (NMS) is applied as a post-processing step to remove duplicate detections. When multiple bounding boxes overlap significantly (measured by intersection over union), only the box with the highest confidence score is retained.

Usage

Apply the YOLO detection principle when:

Real-time object detection is required (video streams, robotics, autonomous driving)
Speed is prioritized over maximum accuracy on small or overlapping objects
Detecting objects across multiple scales using feature pyramid approaches
A single unified architecture is preferred over multi-stage pipelines

Theoretical Basis

Grid-Based Prediction

The image is divided into an $S \times S$ grid. Each cell predicts $B$ bounding boxes. Each box prediction includes:

$(t_{x}, t_{y}, t_{w}, t_{h}, t_{o})$

These raw predictions are transformed using anchor box priors $(p_{w}, p_{h})$ :

$b_{x} = σ (t_{x}) + c_{x}$

$b_{y} = σ (t_{y}) + c_{y}$

$b_{w} = p_{w} \cdot e^{t_{w}}$

$b_{h} = p_{h} \cdot e^{t_{h}}$

where $(c_{x}, c_{y})$ is the top-left corner of the grid cell and $σ$ is the sigmoid function.

Objectness and Class Prediction

The objectness score is:

$P (object) = σ (t_{o})$

Class probabilities are predicted per cell and combined with objectness:

$P ({class}_{i} | object) \cdot P (object) = P ({class}_{i})$

Intersection over Union (IoU)

IoU measures the overlap between predicted box $B_{p}$ and ground truth box $B_{g t}$ :

$IoU = \frac{| B_{p} \cap B_{g t} |}{| B_{p} \cup B_{g t} |}$

Non-Maximum Suppression

After prediction, NMS filters redundant boxes:

Sort all detections by confidence score
Select the highest-scoring detection
Remove all other detections with IoU above a threshold (e.g., 0.5) with the selected detection
Repeat until no detections remain

Multi-Scale Detection

YOLOv3 predicts at three different scales by extracting features from different depths of the network. This enables detection of objects at varying sizes, with deeper features detecting larger objects and shallower features detecting smaller ones.

Related Pages

Implementation:LaurentMazare_Tch_rs_YOLO_Detection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment