a class of optimization algorithms that use the gradient (first derivative) of the objective function to guide the search for an optimal solution. These methods rely on the intuition that the negative gradient points in the direction of steepest descent.
Mathematical Foundation
For an objective function where , gradient-based methods utilize the gradient vector:
The negative gradient points in the direction of steepest descent of the function at the current point.
General Update Rule
The general form of gradient-based methods follows the update rule:
where:
- is the parameter vector at iteration
- is the step size (or learning rate)
- is the gradient at
Key Gradient-Based Methods
Gradient Descent (Steepest Descent)
The most basic gradient-based method, which directly follows the negative gradient direction:
- Initialize
- For each iteration :
- Compute gradient
- Determine step size
- Update:
- Repeat until convergence criteria are met
Variants of Gradient Descent
- Batch Gradient Descent: Uses the entire dataset to compute the gradient
- Stochastic Gradient Descent (SGD): Uses a single sample to approximate the gradient
- Mini-batch Gradient Descent: Uses a small subset of samples to approximate the gradient
Momentum Method
Accelerates convergence by incorporating past gradients:
where is the momentum parameter that determines how much past gradients influence the current update.
Conjugate Gradient Method
Utilizes conjugate directions to achieve faster convergence:
- Initialize , ,
- For each iteration :
- Determine step size via line search
- Update:
- Compute new gradient:
- Compute scaling factor (Fletcher-Reeves formula):
- Update direction:
- Repeat until convergence
Adaptive Gradient Methods
Adapt the learning rate for each parameter based on historical gradients:
- AdaGrad: Divides the learning rate by the square root of the sum of squared past gradients
- RMSProp: Uses an exponentially weighted average of squared gradients
- Adam: Combines momentum with RMSProp
Step Size Determination
The performance of gradient-based methods heavily depends on the choice of step size :
Fixed Step Size
Using a constant for all iterations. Simple but often inefficient, requiring careful tuning.
Line Search Methods
Determine the optimal step size by solving a one-dimensional optimization problem at each iteration:
Exact line search is often computationally expensive, so approximate methods are used:
Backtracking Line Search
- Start with initial step size
- If , then reduce and repeat
- Where and are parameters
Strong Wolfe Conditions
Ensures sufficient decrease and gradient magnitude reduction:
- (Armijo condition)
- (curvature condition)
where
Adaptive Step Sizes
Methods like AdaGrad, RMSProp, and Adam automatically adjust step sizes for each parameter based on the history of gradients.
Convergence Properties
Convergence Rate
For convex functions with Lipschitz continuous gradients:
- Gradient descent with fixed step size: sublinear convergence
- Gradient descent with optimal step size: sublinear convergence
- Accelerated methods (e.g., momentum): sublinear convergence
- Conjugate gradient: Linear convergence, guaranteed to converge in steps for quadratic functions in
Convergence Guarantees
- For convex functions: Convergence to global minimum
- For non-convex functions: Convergence to local minimum or saddle point
- For strongly convex functions with appropriate step size: Linear convergence rate
Challenges and Limitations
Zig-Zagging
Gradient descent can exhibit zig-zagging behavior in narrow valleys, where the negative gradient points across the valley rather than along it.
![]()
Ill-Conditioning
When the Hessian has a high condition number (ratio of largest to smallest eigenvalue), gradient descent converges very slowly. The contours of the objective function become elongated ellipses rather than circles.
Saddle Points
In high-dimensional non-convex problems, saddle points (where the gradient is zero but not a minimum) can significantly slow down gradient-based methods.
Local Minima
Gradient-based methods can get trapped in local minima, missing the global optimum in non-convex problems.
Implementation Considerations
Gradient Evaluation
- Analytical gradients: Most accurate but may be complex to derive
- Numerical gradients: Approximated using finite differences, less accurate but universal
- Automatic differentiation: Combines accuracy of analytical with convenience of numerical methods
Preconditioning
Transforming the optimization problem to improve conditioning:
where is a preconditioning matrix that approximates the inverse Hessian.
Mini-Batch Processing
For large datasets, using mini-batches to approximate gradients can significantly speed up computation:
where is a mini-batch of samples.
Applications
Gradient-based methods are widely used in:
- Machine Learning: Training neural networks and other parametric models
- Image and Signal Processing: Image reconstruction, signal enhancement
- Control Theory: Optimal control strategies
- Economics: Portfolio optimization, utility maximization
- Engineering Design: Parameter tuning, shape optimization