Quasi-Newton methods are a class of optimization algorithms that approximate the Hessian matrix (or its inverse) to achieve faster convergence than first-order methods while avoiding the computational cost of computing and inverting the exact Hessian as in Newton’s method.
Motivation
Newton’s method has excellent convergence properties but requires:
- Computing second derivatives to form the Hessian matrix
- Solving a linear system or inverting the Hessian at each iteration
For large-scale problems, these operations can be prohibitively expensive. Quasi-Newton methods address this by:
- Using only gradient information to build an approximation of the Hessian or its inverse
- Updating this approximation at each iteration using the observed changes in gradients
General Framework
Quasi-Newton methods follow the update rule:
where:
- is the parameter vector at iteration
- is the step size determined by line search
- is an approximation to the Hessian matrix
- is an approximation to the inverse Hessian
Alternatively, some methods directly approximate the inverse Hessian, denoted as :
Secant Condition and Curvature Information
Quasi-Newton methods are based on the secant equation, which captures curvature information:
Define:
- (change in parameters)
- (change in gradients)
For a quadratic function, the exact Hessian would satisfy .
Quasi-Newton methods update the approximate Hessian (or its inverse) to satisfy:
This is known as the secant condition or quasi-Newton condition.
Major Quasi-Newton Methods
BFGS (Broyden-Fletcher-Goldfarb-Shanno)
The most popular quasi-Newton method, which directly approximates the inverse Hessian:
BFGS Algorithm
- Initialize and (often )
- For each iteration :
- Compute search direction:
- Determine step size via line search
- Update parameters:
- Compute and
- Update the inverse Hessian approximation using the BFGS formula
- Repeat until convergence
L-BFGS (Limited-memory BFGS)
A memory-efficient variant of BFGS for large-scale problems:
- Instead of storing the full approximation matrix, L-BFGS stores only the last pairs of
- The matrix-vector product is computed implicitly using these vectors
- Typically is between 3 and 20, regardless of the problem dimension
DFP (Davidon-Fletcher-Powell)
An earlier quasi-Newton method that also approximates the inverse Hessian:
SR1 (Symmetric Rank-One)
A simpler update that satisfies the secant condition:
- SR1 can produce indefinite Hessian approximations
- Often used in trust region methods rather than line search methods
Broyden’s Method
A generalization for non-symmetric matrices, useful in solving systems of nonlinear equations:
Properties of Quasi-Newton Methods
Convergence Properties
- BFGS and DFP: Superlinear convergence rate for smooth, strongly convex functions
- L-BFGS: Slightly slower convergence than BFGS but much lower memory requirements
- SR1: Can have faster convergence but may encounter numerical difficulties
Positive Definiteness
- BFGS: Maintains positive definiteness of if the initial approximation is positive definite and
- DFP: Similar to BFGS but more sensitive to line search accuracy
- SR1: Does not guarantee positive definiteness
Advantages
- Efficiency: Avoids the cost of computing and inverting the Hessian
- Superlinear Convergence: Faster than first-order methods
- Robustness: Less sensitive to poor scaling than gradient descent
- Adaptivity: Automatically builds curvature information during optimization
Limitations
- Memory Requirements: Standard BFGS requires storage
- Initialization Sensitivity: Performance can depend on the initial approximation
- Non-Convex Problems: May struggle with highly non-convex functions
- Numerical Stability: Updates can lead to ill-conditioned approximations
Implementation Considerations
Initial Hessian Approximation
Common choices for the initial inverse Hessian approximation:
- Identity matrix:
- Scaled identity: where
- Diagonal approximation based on early curvature information
Curvature Condition
The condition is necessary for positive definiteness of the BFGS update. If this condition is violated:
- Skip the update for that iteration
- Use a damped update: replace with
- Restart with a new initial approximation
Line Search
Quasi-Newton methods rely on accurate line searches to ensure the curvature information is valid:
- Wolfe conditions are typically used
- Inexact line searches can significantly degrade performance
L-BFGS Implementation
An efficient “two-loop recursion” algorithm for computing without explicitly forming :
q = ∇L(θk)
for i = k-1, k-2, ..., k-m:
ρi = 1/(yi^T si)
αi = ρi si^T q
q = q - αi yi
r = H0 q
for i = k-m, k-m+1, ..., k-1:
β = ρi yi^T r
r = r + si(αi - β)
return r # This is Hk ∇L(θk)
Applications
Quasi-Newton methods are widely used in:
- Machine Learning: Training models with moderate numbers of parameters
- Nonlinear Optimization: Engineering design, control systems
- Statistics: Maximum likelihood estimation
- Structural Optimization: Shape and topology optimization
- Energy Minimization: Molecular dynamics, computational chemistry