Implementation Considerations

Scaling and Conditioning

Problem Scaling

Scaling is a critical aspect of numerical optimization that can dramatically affect algorithm performance.

  • Poorly scaled problems occur when changes in θ in one direction lead to much larger changes in the objective function L than in other directions
  • Example:
    • Here, changes in have a much larger effect on L than changes in
    • This can cause algorithms to progress very slowly in certain directions

Fixing Scaling Issues

  • Variable rescaling: Transform variables to make their effects more uniform
    • For the example above: would help balance the scaling
  • Unit selection: When variables have physical units, choose units judiciously to improve scaling
  • Normalization: Normalize variables to similar ranges (e.g., [0,1] or [-1,1])

Algorithm Sensitivity to Scaling

  • Scale-sensitive algorithms:
    • Steepest descent is highly sensitive to scaling problems
    • May zig-zag and converge very slowly on poorly scaled problems
  • Scale-insensitive algorithms:
    • Newton’s method is invariant to scaling (and any affine transformation)
    • Quasi-Newton methods like BFGS are less affected by scaling issues

Convergence Criteria

Determining when to stop an optimization algorithm is crucial for efficiency and accuracy.

Common Stopping Criteria

  1. Gradient norm:

    • Intuitive: stop when the gradient is small enough
    • Drawback: sensitive to scaling
  2. Newton decrement:

    • Based on the quadratic approximation
    • Advantage: insensitive to scaling, more reliable
  3. Relative improvement:

    • Stop when relative change in objective value is small:
    • Simple but may terminate prematurely
  4. Maximum iterations:

    • Safety measure to prevent infinite loops
    • Important in practical implementations
  5. Constraint violation:

    • For constrained problems, check if constraints are satisfied within tolerance
    • Often used alongside other criteria

Setting Tolerance Values

  • Balancing accuracy and efficiency:
    • Too tight: wastes computational resources
    • Too loose: leads to inaccurate solutions
  • Adaptive tolerance:
    • Start with loose tolerance and tighten as optimization progresses
    • Can save computation in early stages

Computational Complexity

Understanding the computational cost of optimization algorithms helps in algorithm selection.

Cost Components

  1. Cost per iteration:

    • Function evaluations
    • Gradient evaluations
    • Hessian evaluations or approximations
    • Linear system solutions (for Newton-type methods)
  2. Number of iterations required for convergence

  3. Total cost = cost per iteration × number of iterations

Algorithm Complexity

AlgorithmCost per iterationConvergence rateBest for
Gradient DescentO(D)LinearLarge-scale, simple problems
NewtonO(D³)QuadraticSmall-scale, complex problems
Quasi-Newton (BFGS)O(D²)SuperlinearMedium-scale problems
L-BFGSO(D)SuperlinearLarge-scale problems
Conjugate GradientO(D)SuperlinearLarge-scale, well-conditioned

Memory Requirements

  • Newton’s method: O(D²) for storing the Hessian
  • BFGS: O(D²) for storing Hessian approximation
  • L-BFGS: O(mD) where m is the memory parameter (typically m << D)
  • Gradient Descent and CG: O(D) for storing gradient and a few vectors

Carbon Footprint Considerations

Comparing different algorithms based on their computational complexity and employing the most appropriate one with least carbon footprint is an important consideration.

Strategies to Reduce Carbon Footprint

  1. Algorithm selection:

    • Choose the simplest algorithm that can effectively solve your problem
    • Avoid overkill (e.g., using Newton’s method when gradient descent would suffice)
  2. Early stopping:

    • Use appropriate convergence criteria to avoid unnecessary iterations
    • Consider if a slightly less accurate solution is acceptable
  3. Warm starting:

    • Use results from previous similar problems to initialize the algorithm
    • Particularly useful in iterative processes where optimization is repeated
  4. Parallel computing:

    • When appropriate, use parallel algorithms to reduce wall-clock time
    • Note that parallel computing may increase total energy usage
  5. Hardware selection:

    • GPUs can be more energy-efficient for certain optimization problems
    • Consider specialized hardware for specific types of problems

Algorithm Selection Guidelines

Guidelines for choosing the most appropriate optimization algorithm:

Problem Characteristics to Consider

  1. Problem size (D):

    • Small (D < 100): Newton or Quasi-Newton methods are often preferred
    • Medium (100 < D < 1000): L-BFGS or Conjugate Gradient
    • Large (D > 1000): Stochastic methods, L-BFGS, or specialized algorithms
  2. Objective function type:

    • Linear: Linear Programming algorithms (Simplex, Interior Point)
    • Quadratic: Conjugate Gradient or specialized QP solvers
    • General nonlinear: BFGS, L-BFGS, or Newton methods
  3. Availability of derivatives:

    • Gradient available: Use gradient-based methods
    • No gradient: Consider derivative-free methods or finite-difference approximations
    • Hessian available: Consider Newton’s method for small problems
  4. Constraints:

    • Unconstrained: Direct application of optimization algorithms
    • Constrained: Penalty methods, Augmented Lagrangian, or specialized algorithms
    • Box constraints: L-BFGS-B or active set methods
  5. Required accuracy:

    • High accuracy: Newton-type methods
    • Moderate accuracy: Quasi-Newton or Conjugate Gradient
    • Low accuracy: Gradient Descent or stochastic methods

By considering these factors, you can select an algorithm that balances computational efficiency and solution quality for your specific optimization problem.