Implementation Considerations

Scaling and Conditioning

Problem Scaling

Scaling is a critical aspect of numerical optimization that can dramatically affect algorithm performance.

Poorly scaled problems occur when changes in θ in one direction lead to much larger changes in the objective function L than in other directions
Example: $L (θ) = θ_{1}^{2} + 40 (θ_{2} - 2)^{2}$
- Here, changes in $θ_{2}$ have a much larger effect on L than changes in $θ_{1}$
- This can cause algorithms to progress very slowly in certain directions

Fixing Scaling Issues

Variable rescaling: Transform variables to make their effects more uniform
- For the example above: $θ_{2}^{†} = 3 θ_{2}$ would help balance the scaling
Unit selection: When variables have physical units, choose units judiciously to improve scaling
Normalization: Normalize variables to similar ranges (e.g., [0,1] or [-1,1])

Algorithm Sensitivity to Scaling

Scale-sensitive algorithms:
- Steepest descent is highly sensitive to scaling problems
- May zig-zag and converge very slowly on poorly scaled problems
Scale-insensitive algorithms:
- Newton’s method is invariant to scaling (and any affine transformation)
- Quasi-Newton methods like BFGS are less affected by scaling issues

Convergence Criteria

Determining when to stop an optimization algorithm is crucial for efficiency and accuracy.

Common Stopping Criteria

Gradient norm: $∣\nabla L (θ_{i}) ∣ < TOL$
- Intuitive: stop when the gradient is small enough
- Drawback: sensitive to scaling
Newton decrement: $ω_{i} := \nabla L (θ_{i})^{T} H_{i}^{- 1} \nabla L (θ_{i}) < TOL$
- Based on the quadratic approximation
- Advantage: insensitive to scaling, more reliable
Relative improvement:
- Stop when relative change in objective value is small: $\frac{∣ L ( θ _{i + 1} ) - L ( θ _{i} ) ∣}{∣ L ( θ _{i} ) ∣} < TOL$
- Simple but may terminate prematurely
Maximum iterations:
- Safety measure to prevent infinite loops
- Important in practical implementations
Constraint violation:
- For constrained problems, check if constraints are satisfied within tolerance
- Often used alongside other criteria

Setting Tolerance Values

Balancing accuracy and efficiency:
- Too tight: wastes computational resources
- Too loose: leads to inaccurate solutions
Adaptive tolerance:
- Start with loose tolerance and tighten as optimization progresses
- Can save computation in early stages

Computational Complexity

Understanding the computational cost of optimization algorithms helps in algorithm selection.

Cost Components

Cost per iteration:
- Function evaluations
- Gradient evaluations
- Hessian evaluations or approximations
- Linear system solutions (for Newton-type methods)
Number of iterations required for convergence
Total cost = cost per iteration × number of iterations

Algorithm Complexity

Algorithm	Cost per iteration	Convergence rate	Best for
Gradient Descent	O(D)	Linear	Large-scale, simple problems
Newton	O(D³)	Quadratic	Small-scale, complex problems
Quasi-Newton (BFGS)	O(D²)	Superlinear	Medium-scale problems
L-BFGS	O(D)	Superlinear	Large-scale problems
Conjugate Gradient	O(D)	Superlinear	Large-scale, well-conditioned

Memory Requirements

Newton’s method: O(D²) for storing the Hessian
BFGS: O(D²) for storing Hessian approximation
L-BFGS: O(mD) where m is the memory parameter (typically m << D)
Gradient Descent and CG: O(D) for storing gradient and a few vectors

Carbon Footprint Considerations

Comparing different algorithms based on their computational complexity and employing the most appropriate one with least carbon footprint is an important consideration.

Strategies to Reduce Carbon Footprint

Algorithm selection:
- Choose the simplest algorithm that can effectively solve your problem
- Avoid overkill (e.g., using Newton’s method when gradient descent would suffice)
Early stopping:
- Use appropriate convergence criteria to avoid unnecessary iterations
- Consider if a slightly less accurate solution is acceptable
Warm starting:
- Use results from previous similar problems to initialize the algorithm
- Particularly useful in iterative processes where optimization is repeated
Parallel computing:
- When appropriate, use parallel algorithms to reduce wall-clock time
- Note that parallel computing may increase total energy usage
Hardware selection:
- GPUs can be more energy-efficient for certain optimization problems
- Consider specialized hardware for specific types of problems

Algorithm Selection Guidelines

Guidelines for choosing the most appropriate optimization algorithm:

Problem Characteristics to Consider

Problem size (D):
- Small (D < 100): Newton or Quasi-Newton methods are often preferred
- Medium (100 < D < 1000): L-BFGS or Conjugate Gradient
- Large (D > 1000): Stochastic methods, L-BFGS, or specialized algorithms
Objective function type:
- Linear: Linear Programming algorithms (Simplex, Interior Point)
- Quadratic: Conjugate Gradient or specialized QP solvers
- General nonlinear: BFGS, L-BFGS, or Newton methods
Availability of derivatives:
- Gradient available: Use gradient-based methods
- No gradient: Consider derivative-free methods or finite-difference approximations
- Hessian available: Consider Newton’s method for small problems
Constraints:
- Unconstrained: Direct application of optimization algorithms
- Constrained: Penalty methods, Augmented Lagrangian, or specialized algorithms
- Box constraints: L-BFGS-B or active set methods
Required accuracy:
- High accuracy: Newton-type methods
- Moderate accuracy: Quasi-Newton or Conjugate Gradient
- Low accuracy: Gradient Descent or stochastic methods

By considering these factors, you can select an algorithm that balances computational efficiency and solution quality for your specific optimization problem.

Quartz 4

Explorer

Implementation Considerations of Optimisation Methods