Gauss-Newton Method

The Gauss-Newton method is a specialized optimization algorithm primarily used for solving nonlinear least squares problems. It’s particularly effective for parameter estimation in curve fitting and model fitting applications, where we need to minimize the sum of squared residuals between observed data and model predictions.

Problem Formulation

Nonlinear Least Squares

Consider a nonlinear function $f (x; θ)$ with parameters $θ \in R^{n}$ , and a set of $m$ observed data points $(x_{i}, y_{i})$ for $i = 1, 2, \dots, m$ . The residuals are:

$r_{i} (θ) = y_{i} - f (x_{i}; θ)$

The nonlinear least squares problem aims to find the parameters $θ^{*}$ that minimize the sum of squared residuals:

$min_{θ} \sum_{i = 1}^{m} r_{i} (θ)^{2} = min_{θ} ∣ r (θ) ∣^{2}$

where $r (θ) = [r_{1} (θ), r_{2} (θ), \dots, r_{m} (θ)]^{T}$ is the residual vector.

Algorithm Description

The Gauss-Newton method is an iterative procedure that approximates the nonlinear least squares problem with a linear least squares problem at each iteration.

Linearization of the Residual

At each iteration $k$ , we linearize the residual function around the current parameter estimate $θ_{k}$ :

$r_{i} (θ) \approx r_{i} (θ_{k}) + \nabla r_{i} (θ_{k})^{T} (θ - θ_{k})$

where $\nabla r_{i} (θ_{k})$ is the gradient of the $i$ -th residual with respect to $θ$ , evaluated at $θ_{k}$ .

Jacobian Matrix

The gradients of all residuals form the Jacobian matrix $J (θ) \in R^{m \times n}$ :

$J (θ) = [\frac{\partial r _{1}}{\partial θ _{1}} \frac{\partial r _{1}}{\partial θ _{2}} \dots \frac{\partial r _{1}}{\partial θ _{n}} \frac{\partial r _{2}}{\partial θ _{1}} \frac{\partial r _{2}}{\partial θ _{2}} \dots \frac{\partial r _{2}}{\partial θ _{n}} ⋮ ⋮ ⋱ ⋮ \frac{\partial r _{m}}{\partial θ _{1}} \frac{\partial r _{m}}{\partial θ _{2}} \dots \frac{\partial r _{m}}{\partial θ _{n}}]$

For models of the form $f (x; θ)$ , the Jacobian components are:

$J_{ij} (θ) = \frac{\partial r _{i}}{\partial θ _{j}} = - \frac{\partial f ( x _{i} ; θ )}{\partial θ _{j}}$

Update Rule

At each iteration, the Gauss-Newton update is:

$Δ θ = - (J^{T} J)^{- 1} J^{T} r$

where both $J$ and $r$ are evaluated at the current parameter estimate $θ_{k}$ .

The new parameter estimate becomes:

$θ_{k + 1} = θ_{k} + Δ θ$

Algorithm Steps

Choose an initial parameter estimate $θ_{0}$
For iteration $k = 0, 1, 2, \dots$ until convergence:
- Compute the residual vector $r (θ_{k})$
- Compute the Jacobian matrix $J (θ_{k})$
- Solve the normal equations: $(J^{T} J) Δ θ = - J^{T} r$
- Update: $θ_{k + 1} = θ_{k} + Δ θ$
- Check convergence criteria
Return the final parameter estimate $θ^{*}$

Theoretical Justification

Relationship to Newton’s Method

The Gauss-Newton method can be derived as an approximation to Newton’s method for minimizing the least squares objective function:

$L (θ) = \frac{1}{2} ∣ r (θ) ∣^{2} = \frac{1}{2} \sum_{i = 1}^{m} r_{i} (θ)^{2}$

The gradient of $L$ is:

$\nabla L (θ) = J (θ)^{T} r (θ)$

The Hessian of $L$ is:

$\nabla^{2} L (θ) = J (θ)^{T} J (θ) + \sum_{i = 1}^{m} r_{i} (θ) \nabla^{2} r_{i} (θ)$

The Gauss-Newton method approximates the Hessian by dropping the second term:

$\nabla^{2} L (θ) \approx J (θ)^{T} J (θ)$

This approximation is accurate when:

The residuals $r_{i} (θ)$ are small, or
The functions $r_{i} (θ)$ are nearly linear, making $\nabla^{2} r_{i} (θ)$ small

Interpretation as Linear Least Squares

Each Gauss-Newton iteration solves a linear least squares problem:

$min_{Δ θ} ∣ J (θ_{k}) Δ θ + r (θ_{k}) ∣^{2}$

This is obtained by linearizing the residuals and substituting into the original objective function.

Convergence Properties

Convergence Rate

When the starting point is sufficiently close to the solution and the residuals are small or zero at the solution, Gauss-Newton exhibits quadratic convergence (similar to Newton’s method)
With larger residuals, convergence is linear
May not converge if the starting point is far from the solution or if the problem is ill-conditioned

Comparison with Other Methods

Method	Convergence Rate	Residual Size	Second Derivatives	Matrix Inversion
Gauss-Newton	Quadratic/Linear	Small/Large	Not needed	Required
Newton	Quadratic	Any	Required	Required
Gradient Descent	Linear	Any	Not needed	Not required
Levenberg-Marquardt	Varies	Any	Not needed	Required

Limitations and Challenges

Singular or Ill-Conditioned Jacobian

If $J^{T} J$ is singular or nearly singular, the update equation becomes numerically unstable. This can occur due to:

Overparameterization
Highly correlated parameters
Lack of sensitivity to certain parameters

Solutions include:

Regularization (adding a small value to the diagonal of $J^{T} J$ )
Using pseudoinverse instead of inverse
Singular value decomposition (SVD)

Divergence

Gauss-Newton may diverge when:

The initial guess is far from the solution
The objective function has significant nonlinearity
The model is poorly specified relative to the data

Large Residuals

When residuals are large at the solution, the Hessian approximation becomes poor, leading to:

Slower convergence
Potential oscillation or divergence
Suboptimal solutions

Enhancements and Variants

Line Search

Incorporating a line search to determine the step size $α$ :

$θ_{k + 1} = θ_{k} + α Δ θ$

where $α$ is chosen to ensure sufficient decrease in the objective function.

Trust Region Methods

Restrict the step size to a region where the linear approximation is trusted:

$min_{Δ θ} ∣ J (θ_{k}) Δ θ + r (θ_{k}) ∣^{2} subject to ∣Δ θ ∣ \leq Δ_{k}$

where $Δ_{k}$ is the trust region radius adjusted based on model accuracy.

Levenberg-Marquardt Algorithm

A hybrid of Gauss-Newton and gradient descent:

$(J^{T} J + λ I) Δ θ = - J^{T} r$

where $λ$ is a damping parameter that:

When large, makes the algorithm behave like gradient descent (robust but slow)
When small, makes the algorithm behave like Gauss-Newton (fast but potentially unstable)

Practical Implementation

Computing the Jacobian

Three main approaches:

Analytical Jacobian: Derive and implement explicit formulas for derivatives
- Most accurate
- Potentially complex for complicated models
- Error-prone for large models
Numerical Approximation: Use finite differences
- Forward difference: $\frac{\partial r _{i}}{\partial θ _{j}} \approx \frac{r _{i} ( θ + h e _{j} ) - r _{i} ( θ )}{h}$
- Central difference: $\frac{\partial r _{i}}{\partial θ _{j}} \approx \frac{r _{i} ( θ + h e _{j} ) - r _{i} ( θ - h e _{j} )}{2 h}$
- Simple to implement but less accurate and more computationally expensive
Automatic Differentiation: Leverage specialized software
- Combines accuracy of analytical with convenience of numerical methods
- Modern libraries like TensorFlow, PyTorch, JAX support this

Solving the Normal Equations

Methods to solve $(J^{T} J) Δ θ = - J^{T} r$ :

Cholesky Decomposition: If $J^{T} J$ is positive definite
- Decompose $J^{T} J = L L^{T}$
- Solve $Ly = - J^{T} r$ for $y$
- Solve $L^{T} Δ θ = y$ for $Δ θ$
QR Decomposition: More stable but more expensive
- Decompose $J = QR$
- Solve $R Δ θ = - Q^{T} r$
SVD (Singular Value Decomposition): Most stable but most expensive
- Decompose $J = UΣ V^{T}$
- Compute $Δ θ = - V Σ^{+} U^{T} r$

Convergence Criteria

Common stopping conditions:

Small parameter change: $∣Δ θ ∣ < ε_{1}$
Small residual change: $∣ L (θ_{k + 1}) - L (θ_{k}) ∣ < ε_{2}$
Small gradient: $∣ J^{T} r ∣ < ε_{3}$
Maximum iterations reached: $k > k_{m a x}$

Applications

Curve Fitting

Fitting parameterized curves to data points:

Exponential decay: $f (t; A, λ) = A e^{- λ t}$
Power laws: $f (x; a, b) = a x^{b}$
Gaussian peaks: $f (x; μ, σ, A) = A e^{- \frac{( x - μ ) ^{2}}{2 σ ^{2}}}$

Model Calibration

Adjusting model parameters to match experimental data:

Chemical reaction kinetics
Mechanical system parameters
Electrical circuit models

System Identification

Determining mathematical models of dynamic systems:

Transfer function identification
State-space model parameter estimation
Time-series modeling

Implementation Example

Python implementation of the Gauss-Newton method for curve fitting:

import numpy as np
from scipy import optimize
 
def gauss_newton(func, x_data, y_data, theta0, jac=None, tol=1e-6, max_iter=100):
    """
    Gauss-Newton algorithm for nonlinear least squares.
    
    Parameters:
    - func: Function that computes the model prediction
    - x_data: Independent variable data
    - y_data: Dependent variable data (observations)
    - theta0: Initial parameter estimate
    - jac: Function to compute the Jacobian (if None, use finite differences)
    - tol: Convergence tolerance
    - max_iter: Maximum number of iterations
    
    Returns:
    - theta: Optimal parameter estimate
    - info: Information about the optimization
    """
    theta = np.asarray(theta0)
    n_params = len(theta)
    n_samples = len(x_data)
    
    # If no Jacobian function is provided, create a numerical approximation
    if jac is None:
        def numerical_jacobian(theta):
            J = np.zeros((n_samples, n_params))
            epsilon = 1e-8  # Small step for finite differences
            
            for j in range(n_params):
                theta_plus = theta.copy()
                theta_plus[j] += epsilon
                
                # Forward difference approximation
                J[:, j] = -(func(x_data, theta_plus) - func(x_data, theta)) / epsilon
            
            return J
        
        jacobian_func = numerical_jacobian
    else:
        jacobian_func = jac
    
    # Gauss-Newton iterations
    for iteration in range(max_iter):
        # Compute residuals
        y_pred = func(x_data, theta)
        residuals = y_data - y_pred
        
        # Compute objective function value
        obj_value = 0.5 * np.sum(residuals**2)
        
        # Compute Jacobian
        J = jacobian_func(theta)
        
        # Compute gradient of objective function
        gradient = -J.T @ residuals
        
        # Compute Gauss-Newton step
        try:
            # Using normal equations: (J^T J) delta_theta = -J^T residuals
            JTJ = J.T @ J
            delta_theta = np.linalg.solve(JTJ, -J.T @ residuals)
        except np.linalg.LinAlgError:
            # If matrix is singular, use pseudoinverse
            delta_theta = -np.linalg.pinv(J) @ residuals
        
        # Update parameters
        theta_new = theta + delta_theta
        
        # Check convergence
        if np.linalg.norm(delta_theta) < tol:
            theta = theta_new
            break
        
        theta = theta_new
    
    # Compute final residuals and objective value
    y_pred = func(x_data, theta)
    residuals = y_data - y_pred
    final_obj_value = 0.5 * np.sum(residuals**2)
    
    info = {
        'iterations': iteration + 1,
        'obj_value': final_obj_value,
        'residuals': residuals,
        'jacobian': jacobian_func(theta)
    }
    
    return theta, info

Quartz 4

Explorer

Gauss-Newton Method

Problem Formulation

Nonlinear Least Squares

Algorithm Description

Linearization of the Residual

Jacobian Matrix

Update Rule

Algorithm Steps

Theoretical Justification

Relationship to Newton’s Method

Interpretation as Linear Least Squares

Convergence Properties

Convergence Rate

Comparison with Other Methods

Limitations and Challenges

Singular or Ill-Conditioned Jacobian

Divergence

Large Residuals

Enhancements and Variants

Line Search

Trust Region Methods

Levenberg-Marquardt Algorithm

Practical Implementation

Computing the Jacobian

Solving the Normal Equations

Convergence Criteria

Applications

Curve Fitting

Model Calibration

System Identification

Implementation Example

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

Gauss-Newton Method

Problem Formulation

Nonlinear Least Squares

Algorithm Description

Linearization of the Residual

Jacobian Matrix

Update Rule

Algorithm Steps

Theoretical Justification

Relationship to Newton’s Method

Interpretation as Linear Least Squares

Convergence Properties

Convergence Rate

Comparison with Other Methods

Limitations and Challenges

Singular or Ill-Conditioned Jacobian

Divergence

Large Residuals

Enhancements and Variants

Line Search

Trust Region Methods

Levenberg-Marquardt Algorithm

Practical Implementation

Computing the Jacobian

Solving the Normal Equations

Convergence Criteria

Applications

Curve Fitting

Model Calibration

System Identification

Implementation Example

Related Pages

Graph View

Table of Contents

Backlinks