Gradient-Based Methods

a class of optimization algorithms that use the gradient (first derivative) of the objective function to guide the search for an optimal solution. These methods rely on the intuition that the negative gradient points in the direction of steepest descent.

Mathematical Foundation

For an objective function $L (θ)$ where $θ \in R^{D}$ , gradient-based methods utilize the gradient vector:

$\nabla L (θ) = [\frac{\partial L}{\partial θ _{1}}, \frac{\partial L}{\partial θ _{2}}, \dots, \frac{\partial L}{\partial θ _{D}}]^{T}$

The negative gradient $- \nabla L (θ)$ points in the direction of steepest descent of the function at the current point.

General Update Rule

The general form of gradient-based methods follows the update rule:

$θ_{k + 1} = θ_{k} - α_{k} \nabla L (θ_{k})$

where:

$θ_{k}$ is the parameter vector at iteration $k$
$α_{k}$ is the step size (or learning rate)
$\nabla L (θ_{k})$ is the gradient at $θ_{k}$

Key Gradient-Based Methods

Gradient Descent (Steepest Descent)

The most basic gradient-based method, which directly follows the negative gradient direction:

Initialize $θ_{0}$
For each iteration $k$ :
- Compute gradient $\nabla L (θ_{k})$
- Determine step size $α_{k}$
- Update: $θ_{k + 1} = θ_{k} - α_{k} \nabla L (θ_{k})$
Repeat until convergence criteria are met

Variants of Gradient Descent

Batch Gradient Descent: Uses the entire dataset to compute the gradient
Stochastic Gradient Descent (SGD): Uses a single sample to approximate the gradient
Mini-batch Gradient Descent: Uses a small subset of samples to approximate the gradient

Momentum Method

Accelerates convergence by incorporating past gradients:

$v_{k + 1} = β v_{k} + \nabla L (θ_{k})$ $θ_{k + 1} = θ_{k} - α_{k} v_{k + 1}$

where $β \in [0, 1)$ is the momentum parameter that determines how much past gradients influence the current update.

Conjugate Gradient Method

Utilizes conjugate directions to achieve faster convergence:

Initialize $θ_{0}$ , $g_{0} = \nabla L (θ_{0})$ , $p_{0} = - g_{0}$
For each iteration $k$ :
- Determine step size $α_{k}$ via line search
- Update: $θ_{k + 1} = θ_{k} + α_{k} p_{k}$
- Compute new gradient: $g_{k + 1} = \nabla L (θ_{k + 1})$
- Compute scaling factor (Fletcher-Reeves formula): $β_{k} = \frac{g _{k + 1}^{T} g _{k + 1}}{g _{k}^{T} g _{k}}$
- Update direction: $p_{k + 1} = - g_{k + 1} + β_{k} p_{k}$
Repeat until convergence

Adaptive Gradient Methods

Adapt the learning rate for each parameter based on historical gradients:

AdaGrad: Divides the learning rate by the square root of the sum of squared past gradients
RMSProp: Uses an exponentially weighted average of squared gradients
Adam: Combines momentum with RMSProp

Step Size Determination

The performance of gradient-based methods heavily depends on the choice of step size $α_{k}$ :

Fixed Step Size

Using a constant $α_{k} = α$ for all iterations. Simple but often inefficient, requiring careful tuning.

Line Search Methods

Determine the optimal step size by solving a one-dimensional optimization problem at each iteration:

$α_{k} = ar g min_{α \geq 0} L (θ_{k} - α \nabla L (θ_{k}))$

Exact line search is often computationally expensive, so approximate methods are used:

Backtracking Line Search

Start with initial step size $α = \overset{α}{ˉ}$
If $L (θ_{k} - α \nabla L (θ_{k})) > L (θ_{k}) - c α ∣\nabla L (θ_{k}) ∣^{2}$ , then reduce $α = ρ α$ and repeat
Where $c \in (0, 1)$ and $ρ \in (0, 1)$ are parameters

Strong Wolfe Conditions

Ensures sufficient decrease and gradient magnitude reduction:

$L (θ_{k} + α_{k} p_{k}) \leq L (θ_{k}) + c_{1} α_{k} \nabla L (θ_{k})^{T} p_{k}$ (Armijo condition)
$∣\nabla L (θ_{k} + α_{k} p_{k})^{T} p_{k} ∣ \leq c_{2} ∣\nabla L (θ_{k})^{T} p_{k} ∣$ (curvature condition)

where $0 < c_{1} < c_{2} < 1$

Adaptive Step Sizes

Methods like AdaGrad, RMSProp, and Adam automatically adjust step sizes for each parameter based on the history of gradients.

Convergence Properties

Convergence Rate

For convex functions with Lipschitz continuous gradients:

Gradient descent with fixed step size: $O (1/ k)$ sublinear convergence
Gradient descent with optimal step size: $O (1/ k)$ sublinear convergence
Accelerated methods (e.g., momentum): $O (1/ k^{2})$ sublinear convergence
Conjugate gradient: Linear convergence, guaranteed to converge in $n$ steps for quadratic functions in $R^{n}$

Convergence Guarantees

For convex functions: Convergence to global minimum
For non-convex functions: Convergence to local minimum or saddle point
For strongly convex functions with appropriate step size: Linear convergence rate

Challenges and Limitations

Zig-Zagging

Gradient descent can exhibit zig-zagging behavior in narrow valleys, where the negative gradient points across the valley rather than along it.

Ill-Conditioning

When the Hessian has a high condition number (ratio of largest to smallest eigenvalue), gradient descent converges very slowly. The contours of the objective function become elongated ellipses rather than circles.

Saddle Points

In high-dimensional non-convex problems, saddle points (where the gradient is zero but not a minimum) can significantly slow down gradient-based methods.

Local Minima

Gradient-based methods can get trapped in local minima, missing the global optimum in non-convex problems.

Implementation Considerations

Gradient Evaluation

Analytical gradients: Most accurate but may be complex to derive
Numerical gradients: Approximated using finite differences, less accurate but universal
Automatic differentiation: Combines accuracy of analytical with convenience of numerical methods

Preconditioning

Transforming the optimization problem to improve conditioning:

$θ_{k + 1} = θ_{k} - α_{k} P_{k} \nabla L (θ_{k})$

where $P_{k}$ is a preconditioning matrix that approximates the inverse Hessian.

Mini-Batch Processing

For large datasets, using mini-batches to approximate gradients can significantly speed up computation:

$\nabla L (θ) \approx \frac{1}{∣ B ∣} \sum_{i \in B} \nabla L_{i} (θ)$

where $B$ is a mini-batch of samples.

Applications

Gradient-based methods are widely used in:

Machine Learning: Training neural networks and other parametric models
Image and Signal Processing: Image reconstruction, signal enhancement
Control Theory: Optimal control strategies
Economics: Portfolio optimization, utility maximization
Engineering Design: Parameter tuning, shape optimization

Quartz 4

Explorer

Gradient-Based Methods

Mathematical Foundation

General Update Rule

Key Gradient-Based Methods

Gradient Descent (Steepest Descent)

Variants of Gradient Descent

Momentum Method

Conjugate Gradient Method

Adaptive Gradient Methods

Step Size Determination

Fixed Step Size

Line Search Methods

Backtracking Line Search

Strong Wolfe Conditions

Adaptive Step Sizes

Convergence Properties

Convergence Rate

Convergence Guarantees

Challenges and Limitations

Zig-Zagging

Ill-Conditioning

Saddle Points

Local Minima

Implementation Considerations

Gradient Evaluation

Preconditioning

Mini-Batch Processing

Applications

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

Gradient-Based Methods

Mathematical Foundation

General Update Rule

Key Gradient-Based Methods

Gradient Descent (Steepest Descent)

Variants of Gradient Descent

Momentum Method

Conjugate Gradient Method

Adaptive Gradient Methods

Step Size Determination

Fixed Step Size

Line Search Methods

Backtracking Line Search

Strong Wolfe Conditions

Adaptive Step Sizes

Convergence Properties

Convergence Rate

Convergence Guarantees

Challenges and Limitations

Zig-Zagging

Ill-Conditioning

Saddle Points

Local Minima

Implementation Considerations

Gradient Evaluation

Preconditioning

Mini-Batch Processing

Applications

Related Pages

Graph View

Table of Contents

Backlinks