Quasi-Newton Methods

Quasi-Newton methods are a class of optimization algorithms that approximate the Hessian matrix (or its inverse) to achieve faster convergence than first-order methods while avoiding the computational cost of computing and inverting the exact Hessian as in Newton’s method.

Motivation

Newton’s method has excellent convergence properties but requires:

Computing second derivatives to form the Hessian matrix $H$
Solving a linear system or inverting the Hessian at each iteration

For large-scale problems, these operations can be prohibitively expensive. Quasi-Newton methods address this by:

Using only gradient information to build an approximation of the Hessian or its inverse
Updating this approximation at each iteration using the observed changes in gradients

General Framework

Quasi-Newton methods follow the update rule:

$θ_{k + 1} = θ_{k} - α_{k} B_{k}^{- 1} \nabla L (θ_{k})$

where:

$θ_{k}$ is the parameter vector at iteration $k$
$α_{k}$ is the step size determined by line search
$B_{k}$ is an approximation to the Hessian matrix
$B_{k}^{- 1}$ is an approximation to the inverse Hessian

Alternatively, some methods directly approximate the inverse Hessian, denoted as $H_{k} \approx [\nabla^{2} L (θ_{k})]^{- 1}$ :

$θ_{k + 1} = θ_{k} - α_{k} H_{k} \nabla L (θ_{k})$

Secant Condition and Curvature Information

Quasi-Newton methods are based on the secant equation, which captures curvature information:

Define:

$s_{k} = θ_{k + 1} - θ_{k}$ (change in parameters)
$y_{k} = \nabla L (θ_{k + 1}) - \nabla L (θ_{k})$ (change in gradients)

For a quadratic function, the exact Hessian would satisfy $y_{k} = H s_{k}$ .

Quasi-Newton methods update the approximate Hessian (or its inverse) to satisfy:

$B_{k + 1} s_{k} = y_{k}$

This is known as the secant condition or quasi-Newton condition.

Major Quasi-Newton Methods

BFGS (Broyden-Fletcher-Goldfarb-Shanno)

The most popular quasi-Newton method, which directly approximates the inverse Hessian:

$H_{k + 1} = (I - \frac{s _{k} y _{k}^{T}}{y _{k}^{T} s _{k}}) H_{k} (I - \frac{y _{k} s _{k}^{T}}{y _{k}^{T} s _{k}}) + \frac{s _{k} s _{k}^{T}}{y _{k}^{T} s _{k}}$

BFGS Algorithm

Initialize $θ_{0}$ and $H_{0}$ (often $H_{0} = I$ )
For each iteration $k$ :
- Compute search direction: $p_{k} = - H_{k} \nabla L (θ_{k})$
- Determine step size $α_{k}$ via line search
- Update parameters: $θ_{k + 1} = θ_{k} + α_{k} p_{k}$
- Compute $s_{k} = θ_{k + 1} - θ_{k}$ and $y_{k} = \nabla L (θ_{k + 1}) - \nabla L (θ_{k})$
- Update the inverse Hessian approximation using the BFGS formula
Repeat until convergence

L-BFGS (Limited-memory BFGS)

A memory-efficient variant of BFGS for large-scale problems:

Instead of storing the full $n \times n$ approximation matrix, L-BFGS stores only the last $m$ pairs of $s_{i}, y_{i}$
The matrix-vector product $H_{k} \nabla L (θ_{k})$ is computed implicitly using these vectors
Typically $m$ is between 3 and 20, regardless of the problem dimension

DFP (Davidon-Fletcher-Powell)

An earlier quasi-Newton method that also approximates the inverse Hessian:

$H_{k + 1} = H_{k} - \frac{H _{k} y _{k} y _{k}^{T} H _{k}}{y _{k}^{T} H _{k} y _{k}} + \frac{s _{k} s _{k}^{T}}{y _{k}^{T} s _{k}}$

SR1 (Symmetric Rank-One)

A simpler update that satisfies the secant condition:

$B_{k + 1} = B_{k} + \frac{( y _{k} - B _{k} s _{k} ) ( y _{k} - B _{k} s _{k} ) ^{T}}{( y _{k} - B _{k} s _{k} ) ^{T} s _{k}}$

SR1 can produce indefinite Hessian approximations
Often used in trust region methods rather than line search methods

Broyden’s Method

A generalization for non-symmetric matrices, useful in solving systems of nonlinear equations:

$B_{k + 1} = B_{k} + \frac{( y _{k} - B _{k} s _{k} ) s _{k}^{T}}{s _{k}^{T} s _{k}}$

Properties of Quasi-Newton Methods

Convergence Properties

BFGS and DFP: Superlinear convergence rate for smooth, strongly convex functions
L-BFGS: Slightly slower convergence than BFGS but much lower memory requirements
SR1: Can have faster convergence but may encounter numerical difficulties

Positive Definiteness

BFGS: Maintains positive definiteness of $H_{k}$ if the initial approximation is positive definite and $y_{k}^{T} s_{k} > 0$
DFP: Similar to BFGS but more sensitive to line search accuracy
SR1: Does not guarantee positive definiteness

Advantages

Efficiency: Avoids the $O (n^{3})$ cost of computing and inverting the Hessian
Superlinear Convergence: Faster than first-order methods
Robustness: Less sensitive to poor scaling than gradient descent
Adaptivity: Automatically builds curvature information during optimization

Limitations

Memory Requirements: Standard BFGS requires $O (n^{2})$ storage
Initialization Sensitivity: Performance can depend on the initial approximation
Non-Convex Problems: May struggle with highly non-convex functions
Numerical Stability: Updates can lead to ill-conditioned approximations

Implementation Considerations

Initial Hessian Approximation

Common choices for the initial inverse Hessian approximation:

Identity matrix: $H_{0} = I$
Scaled identity: $H_{0} = γ I$ where $γ > 0$
Diagonal approximation based on early curvature information

Curvature Condition

The condition $y_{k}^{T} s_{k} > 0$ is necessary for positive definiteness of the BFGS update. If this condition is violated:

Skip the update for that iteration
Use a damped update: replace $y_{k}$ with $y_{k}^{damp} = θ y_{k} + (1 - θ) B_{k} s_{k}$
Restart with a new initial approximation

Line Search

Quasi-Newton methods rely on accurate line searches to ensure the curvature information is valid:

Wolfe conditions are typically used
Inexact line searches can significantly degrade performance

L-BFGS Implementation

An efficient “two-loop recursion” algorithm for computing $H_{k} \nabla L (θ_{k})$ without explicitly forming $H_{k}$ :

q = ∇L(θk)
for i = k-1, k-2, ..., k-m:
    ρi = 1/(yi^T si)
    αi = ρi si^T q
    q = q - αi yi
r = H0 q
for i = k-m, k-m+1, ..., k-1:
    β = ρi yi^T r
    r = r + si(αi - β)
return r  # This is Hk ∇L(θk)

Applications

Quasi-Newton methods are widely used in:

Machine Learning: Training models with moderate numbers of parameters
Nonlinear Optimization: Engineering design, control systems
Statistics: Maximum likelihood estimation
Structural Optimization: Shape and topology optimization
Energy Minimization: Molecular dynamics, computational chemistry

Quartz 4

Explorer

Quasi-Newton Methods

Motivation

General Framework

Secant Condition and Curvature Information

Major Quasi-Newton Methods

BFGS (Broyden-Fletcher-Goldfarb-Shanno)

BFGS Algorithm

L-BFGS (Limited-memory BFGS)

DFP (Davidon-Fletcher-Powell)

SR1 (Symmetric Rank-One)

Broyden’s Method

Properties of Quasi-Newton Methods

Convergence Properties

Positive Definiteness

Advantages

Limitations

Implementation Considerations

Initial Hessian Approximation

Curvature Condition

Line Search

L-BFGS Implementation

Applications

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

Quasi-Newton Methods

Motivation

General Framework

Secant Condition and Curvature Information

Major Quasi-Newton Methods

BFGS (Broyden-Fletcher-Goldfarb-Shanno)

BFGS Algorithm

L-BFGS (Limited-memory BFGS)

DFP (Davidon-Fletcher-Powell)

SR1 (Symmetric Rank-One)

Broyden’s Method

Properties of Quasi-Newton Methods

Convergence Properties

Positive Definiteness

Advantages

Limitations

Implementation Considerations

Initial Hessian Approximation

Curvature Condition

Line Search

L-BFGS Implementation

Applications

Related Pages

Graph View

Table of Contents

Backlinks