The Newton-Raphson method (often simply called Newton’s method) is a powerful optimization technique that uses both first and second derivatives to find the minimum or maximum of a function. It is named after Isaac Newton and Joseph Raphson.
Principle
Newton’s method is based on using a quadratic approximation of the objective function at each iteration. For a function , the quadratic approximation around the current point is:
where is the gradient and is the Hessian matrix.
Algorithm
Univariate Case (D=1)
For a function with scalar variable :
- Start with an initial guess
- At each iteration :
- Compute the first derivative
- Compute the second derivative
- Update:
- Stop when or
The geometric interpretation is that we find the minimum of the parabola that approximates the function at the current point.
Multivariate Case (D≥2)
For a function with vector variable :
- Start with an initial guess
- At each iteration :
- Compute the gradient
- Compute the Hessian
- Solve the linear system:
- Update:
- Stop when or
This can also be written as: , though directly computing the inverse is generally avoided in practice due to numerical considerations.
Derivation
The derivation comes from setting the derivative of the quadratic approximation to zero:
Solving for gives:
This becomes the update rule for the next iteration.
Convergence Properties
Newton’s method has excellent convergence properties when started sufficiently close to a solution:
- Quadratic convergence: The error approximately squares at each iteration
- Typically requires fewer iterations than gradient-based methods
- Invariant to linear transformations of parameters (unlike gradient descent)
However, convergence is only guaranteed under certain conditions:
- The Hessian must be positive definite at each iteration (for minimization)
- The initial point must be sufficiently close to a local minimum
Modified Newton Methods
To address potential issues with the basic Newton method:
Line Search
Introduce a step size parameter :
The step size is chosen to ensure:
- Descent property:
- Appropriate step length via Wolfe conditions or Armijo rule
Hessian Modification
When the Hessian is not positive definite:
- Add a positive diagonal matrix: (becomes Levenberg-Marquardt method)
- Perform eigenvalue modification to ensure positive definiteness
- Use trust region approaches
Advantages
- Rapid convergence near the solution
- Scale-invariant (unlike gradient descent)
- Typically requires fewer iterations than first-order methods
- Particularly effective for quadratic or nearly quadratic functions
Disadvantages
- Requires computation of second derivatives
- Expensive for large-scale problems (O(D³) for the linear solve)
- May diverge if started far from the solution
- Not guaranteed to converge for non-convex functions
Example: Newton’s Method for a Quadratic Function
For a quadratic function :
- Gradient:
- Hessian: (constant)
Newton’s method becomes:
For a positive definite , Newton’s method finds the exact minimum in a single step!
Implementation Considerations
- For large-scale problems, avoid explicit Hessian computation and inversion
- Use iterative methods to solve the linear system
- Consider quasi-Newton methods that approximate the Hessian
- Use careful line search to ensure convergence
- Monitor the conditioning of the Hessian