The Newton-Raphson method (often simply called Newton’s method) is a powerful optimization technique that uses both first and second derivatives to find the minimum or maximum of a function. It is named after Isaac Newton and Joseph Raphson.

Principle

Newton’s method is based on using a quadratic approximation of the objective function at each iteration. For a function , the quadratic approximation around the current point is:

where is the gradient and is the Hessian matrix.

Algorithm

Univariate Case (D=1)

For a function with scalar variable :

  1. Start with an initial guess
  2. At each iteration :
    • Compute the first derivative
    • Compute the second derivative
    • Update:
  3. Stop when or

The geometric interpretation is that we find the minimum of the parabola that approximates the function at the current point.

Multivariate Case (D≥2)

For a function with vector variable :

  1. Start with an initial guess
  2. At each iteration :
    • Compute the gradient
    • Compute the Hessian
    • Solve the linear system:
    • Update:
  3. Stop when or

This can also be written as: , though directly computing the inverse is generally avoided in practice due to numerical considerations.

Derivation

The derivation comes from setting the derivative of the quadratic approximation to zero:

Solving for gives:

This becomes the update rule for the next iteration.

Convergence Properties

Newton’s method has excellent convergence properties when started sufficiently close to a solution:

  • Quadratic convergence: The error approximately squares at each iteration
  • Typically requires fewer iterations than gradient-based methods
  • Invariant to linear transformations of parameters (unlike gradient descent)

However, convergence is only guaranteed under certain conditions:

  • The Hessian must be positive definite at each iteration (for minimization)
  • The initial point must be sufficiently close to a local minimum

Modified Newton Methods

To address potential issues with the basic Newton method:

Introduce a step size parameter :

The step size is chosen to ensure:

  • Descent property:
  • Appropriate step length via Wolfe conditions or Armijo rule

Hessian Modification

When the Hessian is not positive definite:

  • Add a positive diagonal matrix: (becomes Levenberg-Marquardt method)
  • Perform eigenvalue modification to ensure positive definiteness
  • Use trust region approaches

Advantages

  • Rapid convergence near the solution
  • Scale-invariant (unlike gradient descent)
  • Typically requires fewer iterations than first-order methods
  • Particularly effective for quadratic or nearly quadratic functions

Disadvantages

  • Requires computation of second derivatives
  • Expensive for large-scale problems (O(D³) for the linear solve)
  • May diverge if started far from the solution
  • Not guaranteed to converge for non-convex functions

Example: Newton’s Method for a Quadratic Function

For a quadratic function :

  • Gradient:
  • Hessian: (constant)

Newton’s method becomes:

For a positive definite , Newton’s method finds the exact minimum in a single step!

Implementation Considerations

  • For large-scale problems, avoid explicit Hessian computation and inversion
  • Use iterative methods to solve the linear system
  • Consider quasi-Newton methods that approximate the Hessian
  • Use careful line search to ensure convergence
  • Monitor the conditioning of the Hessian