Quasi-Newton methods are a class of optimization algorithms that approximate the Hessian matrix (or its inverse) to achieve faster convergence than first-order methods while avoiding the computational cost of computing and inverting the exact Hessian as in Newton’s method.

Motivation

Newton’s method has excellent convergence properties but requires:

  1. Computing second derivatives to form the Hessian matrix
  2. Solving a linear system or inverting the Hessian at each iteration

For large-scale problems, these operations can be prohibitively expensive. Quasi-Newton methods address this by:

  • Using only gradient information to build an approximation of the Hessian or its inverse
  • Updating this approximation at each iteration using the observed changes in gradients

General Framework

Quasi-Newton methods follow the update rule:

where:

  • is the parameter vector at iteration
  • is the step size determined by line search
  • is an approximation to the Hessian matrix
  • is an approximation to the inverse Hessian

Alternatively, some methods directly approximate the inverse Hessian, denoted as :

Secant Condition and Curvature Information

Quasi-Newton methods are based on the secant equation, which captures curvature information:

Define:

  • (change in parameters)
  • (change in gradients)

For a quadratic function, the exact Hessian would satisfy .

Quasi-Newton methods update the approximate Hessian (or its inverse) to satisfy:

This is known as the secant condition or quasi-Newton condition.

Major Quasi-Newton Methods

BFGS (Broyden-Fletcher-Goldfarb-Shanno)

The most popular quasi-Newton method, which directly approximates the inverse Hessian:

BFGS Algorithm

  1. Initialize and (often )
  2. For each iteration :
    • Compute search direction:
    • Determine step size via line search
    • Update parameters:
    • Compute and
    • Update the inverse Hessian approximation using the BFGS formula
  3. Repeat until convergence

L-BFGS (Limited-memory BFGS)

A memory-efficient variant of BFGS for large-scale problems:

  • Instead of storing the full approximation matrix, L-BFGS stores only the last pairs of
  • The matrix-vector product is computed implicitly using these vectors
  • Typically is between 3 and 20, regardless of the problem dimension

DFP (Davidon-Fletcher-Powell)

An earlier quasi-Newton method that also approximates the inverse Hessian:

SR1 (Symmetric Rank-One)

A simpler update that satisfies the secant condition:

  • SR1 can produce indefinite Hessian approximations
  • Often used in trust region methods rather than line search methods

Broyden’s Method

A generalization for non-symmetric matrices, useful in solving systems of nonlinear equations:

Properties of Quasi-Newton Methods

Convergence Properties

  • BFGS and DFP: Superlinear convergence rate for smooth, strongly convex functions
  • L-BFGS: Slightly slower convergence than BFGS but much lower memory requirements
  • SR1: Can have faster convergence but may encounter numerical difficulties

Positive Definiteness

  • BFGS: Maintains positive definiteness of if the initial approximation is positive definite and
  • DFP: Similar to BFGS but more sensitive to line search accuracy
  • SR1: Does not guarantee positive definiteness

Advantages

  1. Efficiency: Avoids the cost of computing and inverting the Hessian
  2. Superlinear Convergence: Faster than first-order methods
  3. Robustness: Less sensitive to poor scaling than gradient descent
  4. Adaptivity: Automatically builds curvature information during optimization

Limitations

  1. Memory Requirements: Standard BFGS requires storage
  2. Initialization Sensitivity: Performance can depend on the initial approximation
  3. Non-Convex Problems: May struggle with highly non-convex functions
  4. Numerical Stability: Updates can lead to ill-conditioned approximations

Implementation Considerations

Initial Hessian Approximation

Common choices for the initial inverse Hessian approximation:

  • Identity matrix:
  • Scaled identity: where
  • Diagonal approximation based on early curvature information

Curvature Condition

The condition is necessary for positive definiteness of the BFGS update. If this condition is violated:

  • Skip the update for that iteration
  • Use a damped update: replace with
  • Restart with a new initial approximation

Quasi-Newton methods rely on accurate line searches to ensure the curvature information is valid:

  • Wolfe conditions are typically used
  • Inexact line searches can significantly degrade performance

L-BFGS Implementation

An efficient “two-loop recursion” algorithm for computing without explicitly forming :

q = ∇L(θk)
for i = k-1, k-2, ..., k-m:
    ρi = 1/(yi^T si)
    αi = ρi si^T q
    q = q - αi yi
r = H0 q
for i = k-m, k-m+1, ..., k-1:
    β = ρi yi^T r
    r = r + si(αi - β)
return r  # This is Hk ∇L(θk)

Applications

Quasi-Newton methods are widely used in:

  • Machine Learning: Training models with moderate numbers of parameters
  • Nonlinear Optimization: Engineering design, control systems
  • Statistics: Maximum likelihood estimation
  • Structural Optimization: Shape and topology optimization
  • Energy Minimization: Molecular dynamics, computational chemistry