Principle:Rapidsai Cuml Support Vector Machines

Knowledge Sources	Cortes and Vapnik 1995 - Support-Vector Networks Platt 1998 - Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines Scholkopf and Smola 2002 - Learning with Kernels
Domains	Machine_Learning, Classification, Regression
Last Updated	2026-02-08 12:00 GMT

Overview

Support Vector Machines are a family of supervised learning methods that find the optimal separating hyperplane in a (possibly kernel-induced) feature space by maximizing the margin between classes, using the SMO algorithm for classification (SVC) and epsilon-insensitive loss for regression (SVR).

Description

Support Vector Machines (SVMs) solve classification and regression problems by mapping data into a high-dimensional feature space and finding the hyperplane that best separates the classes (for classification) or best fits the data within an epsilon tube (for regression). The key concepts are:

Maximum Margin Classification: Given linearly separable data, infinitely many hyperplanes can separate the classes. SVM chooses the hyperplane that maximizes the geometric margin -- the minimum distance from the decision boundary to the nearest training point. This maximum-margin principle provides strong generalization guarantees rooted in statistical learning theory. For non-separable data, a soft-margin formulation allows controlled misclassification via slack variables and a penalty parameter C.

Kernel Methods: When data is not linearly separable in the original feature space, SVM uses the kernel trick to implicitly map data into a higher-dimensional space where linear separation is possible. Common kernels include:

Linear kernel: $K (x_{i}, x_{j}) = x_{i}^{T} x_{j}$ -- no transformation; efficient for high-dimensional sparse data.
RBF (Gaussian) kernel: $K (x_{i}, x_{j}) = \exp (- γ ‖ x_{i} - x_{j} ‖^{2})$ -- maps to infinite-dimensional space; the default for nonlinear problems.
Polynomial kernel: $K (x_{i}, x_{j}) = (γ x_{i}^{T} x_{j} + r)^{d}$ -- maps to a finite-dimensional polynomial feature space.
Sigmoid kernel: $K (x_{i}, x_{j}) = \tanh (γ x_{i}^{T} x_{j} + r)$ -- related to neural network activation functions.

Epsilon-Support Vector Regression (SVR): SVR fits a function within an epsilon-insensitive tube around the training data. Points within the tube incur zero loss; points outside the tube incur a linear penalty proportional to their distance from the tube boundary. This produces a sparse model where only the support vectors (points on or outside the tube boundary) contribute to the prediction.

Sequential Minimal Optimization (SMO): The SMO algorithm solves the SVM dual optimization problem by iteratively selecting pairs of Lagrange multipliers and analytically optimizing them while holding all others fixed. This decomposes the large quadratic programming problem into a sequence of minimal two-variable subproblems with closed-form solutions. Working set selection heuristics determine which pair to optimize at each step to achieve fast convergence.

Linear SVM: For linear kernels, specialized solvers can operate directly in the primal formulation, avoiding the quadratic kernel matrix entirely. This approach scales linearly with the number of samples and is highly efficient for large, sparse datasets.

Usage

Support Vector Machines are the right choice when:

The dataset is small to medium sized (up to tens of thousands of samples), as the kernel matrix scales quadratically.
A strong theoretical guarantee on generalization is desired (margin theory).
The decision boundary is nonlinear and a kernel can capture the relevant structure.
The number of features is large relative to the number of samples (SVMs handle high-dimensional spaces well).
Sparse models are desired, where only support vectors influence predictions.

For very large datasets, linear SVM or SGD-based linear classifiers are preferred over kernel SVM due to computational constraints. For problems requiring probability estimates, SVM can be augmented with Platt scaling, though native probabilistic models may be more appropriate.

Theoretical Basis

Primal Formulation (Soft-Margin SVC):

$\min_{w, b, ξ} \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} ξ_{i}$

subject to: $y_{i} (w^{T} ϕ (x_{i}) + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0$

Dual Formulation:

$\max_{α} \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j})$

subject to: $0 \leq α_{i} \leq C, \sum_{i = 1}^{n} α_{i} y_{i} = 0$

SMO Two-Variable Update:

Select working set (i, j) using maximal violating pair heuristic:
    i = argmax { -y_i * grad(f)_i : alpha_i not at upper bound }
    j = argmin { -y_j * grad(f)_j : alpha_j not at lower bound }

Compute unclipped new value for alpha_j:
    eta = K(x_i, x_i) + K(x_j, x_j) - 2 * K(x_i, x_j)
    alpha_j_new = alpha_j_old + y_j * (E_i - E_j) / eta

Clip alpha_j_new to [L, H] bounds
Update alpha_i to maintain the linear constraint
Update the gradient cache

Epsilon-SVR Formulation:

$\min_{w, b, ξ, ξ^{*}} \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*})$

subject to: $y_{i} - w^{T} ϕ (x_{i}) - b \leq ϵ + ξ_{i}$ , $w^{T} ϕ (x_{i}) + b - y_{i} \leq ϵ + ξ_{i}^{*}$ , $ξ_{i}, ξ_{i}^{*} \geq 0$

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment