Implementation Considerations
Scaling and Conditioning
Problem Scaling
Scaling is a critical aspect of numerical optimization that can dramatically affect algorithm performance.
- Poorly scaled problems occur when changes in θ in one direction lead to much larger changes in the objective function L than in other directions
- Example:
- Here, changes in have a much larger effect on L than changes in
- This can cause algorithms to progress very slowly in certain directions
Fixing Scaling Issues
- Variable rescaling: Transform variables to make their effects more uniform
- For the example above: would help balance the scaling
- Unit selection: When variables have physical units, choose units judiciously to improve scaling
- Normalization: Normalize variables to similar ranges (e.g., [0,1] or [-1,1])
Algorithm Sensitivity to Scaling
- Scale-sensitive algorithms:
- Steepest descent is highly sensitive to scaling problems
- May zig-zag and converge very slowly on poorly scaled problems
- Scale-insensitive algorithms:
- Newton’s method is invariant to scaling (and any affine transformation)
- Quasi-Newton methods like BFGS are less affected by scaling issues
Convergence Criteria
Determining when to stop an optimization algorithm is crucial for efficiency and accuracy.
Common Stopping Criteria
-
Gradient norm:
- Intuitive: stop when the gradient is small enough
- Drawback: sensitive to scaling
-
Newton decrement:
- Based on the quadratic approximation
- Advantage: insensitive to scaling, more reliable
-
Relative improvement:
- Stop when relative change in objective value is small:
- Simple but may terminate prematurely
-
Maximum iterations:
- Safety measure to prevent infinite loops
- Important in practical implementations
-
Constraint violation:
- For constrained problems, check if constraints are satisfied within tolerance
- Often used alongside other criteria
Setting Tolerance Values
- Balancing accuracy and efficiency:
- Too tight: wastes computational resources
- Too loose: leads to inaccurate solutions
- Adaptive tolerance:
- Start with loose tolerance and tighten as optimization progresses
- Can save computation in early stages
Computational Complexity
Understanding the computational cost of optimization algorithms helps in algorithm selection.
Cost Components
-
Cost per iteration:
- Function evaluations
- Gradient evaluations
- Hessian evaluations or approximations
- Linear system solutions (for Newton-type methods)
-
Number of iterations required for convergence
-
Total cost = cost per iteration × number of iterations
Algorithm Complexity
| Algorithm | Cost per iteration | Convergence rate | Best for |
|---|---|---|---|
| Gradient Descent | O(D) | Linear | Large-scale, simple problems |
| Newton | O(D³) | Quadratic | Small-scale, complex problems |
| Quasi-Newton (BFGS) | O(D²) | Superlinear | Medium-scale problems |
| L-BFGS | O(D) | Superlinear | Large-scale problems |
| Conjugate Gradient | O(D) | Superlinear | Large-scale, well-conditioned |
Memory Requirements
- Newton’s method: O(D²) for storing the Hessian
- BFGS: O(D²) for storing Hessian approximation
- L-BFGS: O(mD) where m is the memory parameter (typically m << D)
- Gradient Descent and CG: O(D) for storing gradient and a few vectors
Carbon Footprint Considerations
Comparing different algorithms based on their computational complexity and employing the most appropriate one with least carbon footprint is an important consideration.
Strategies to Reduce Carbon Footprint
-
Algorithm selection:
- Choose the simplest algorithm that can effectively solve your problem
- Avoid overkill (e.g., using Newton’s method when gradient descent would suffice)
-
Early stopping:
- Use appropriate convergence criteria to avoid unnecessary iterations
- Consider if a slightly less accurate solution is acceptable
-
Warm starting:
- Use results from previous similar problems to initialize the algorithm
- Particularly useful in iterative processes where optimization is repeated
-
Parallel computing:
- When appropriate, use parallel algorithms to reduce wall-clock time
- Note that parallel computing may increase total energy usage
-
Hardware selection:
- GPUs can be more energy-efficient for certain optimization problems
- Consider specialized hardware for specific types of problems
Algorithm Selection Guidelines
Guidelines for choosing the most appropriate optimization algorithm:
Problem Characteristics to Consider
-
Problem size (D):
- Small (D < 100): Newton or Quasi-Newton methods are often preferred
- Medium (100 < D < 1000): L-BFGS or Conjugate Gradient
- Large (D > 1000): Stochastic methods, L-BFGS, or specialized algorithms
-
Objective function type:
- Linear: Linear Programming algorithms (Simplex, Interior Point)
- Quadratic: Conjugate Gradient or specialized QP solvers
- General nonlinear: BFGS, L-BFGS, or Newton methods
-
Availability of derivatives:
- Gradient available: Use gradient-based methods
- No gradient: Consider derivative-free methods or finite-difference approximations
- Hessian available: Consider Newton’s method for small problems
-
Constraints:
- Unconstrained: Direct application of optimization algorithms
- Constrained: Penalty methods, Augmented Lagrangian, or specialized algorithms
- Box constraints: L-BFGS-B or active set methods
-
Required accuracy:
- High accuracy: Newton-type methods
- Moderate accuracy: Quasi-Newton or Conjugate Gradient
- Low accuracy: Gradient Descent or stochastic methods
By considering these factors, you can select an algorithm that balances computational efficiency and solution quality for your specific optimization problem.