Eng Opt Scrapbook

Lecture 2

Notes from l1:

We can approx Hessian with finite diff, such as using secant

For multivariate, $L (θ)$ where $θ$ is bold, and we are now considering that $θ \in ℜ^{D}$ and that $L : ℜ^{D} \to ℜ$

Lecture 3

L (\tilde{θ}) = θ_{1} - θ_{2} + 2 θ_{1}^{2} + 2 θ_{1} θ_{2} + θ_{2}^{2}

\tilde{g} (\tilde{θ}) = [\partial L / \partial θ_{1} \partial L / \partial θ_{2}] = [1 + 4 θ_{1} + 2 θ_{2} - 1 + 2 θ_{1} + 2 θ_{2}]

Starting Guess: \tilde{θ_{o}} = [00] \Rightarrow g_{o} = [1 - 1]

Δ \tilde{θ} = - α \tilde{g_{o}} = [- α + α]

\tilde{θ}_{1} = [0 - α 0 + α] = [- α + α]

L (α) = L (- α, α) = - α + α + 2 (- α)^{2} + 2 (- α) (α) + α^{2} = α^{2}

Finding $α$ here is called Line Search

Chapter 3 SS Rao Book

Hessian:

L (\tilde{θ}) = θ_{1} - θ_{2} + 2 θ_{1}^{2} + 2 θ_{1} θ_{2} + θ_{2}^{2}

\tilde{g} (\tilde{θ}) = [\partial L / \partial θ_{1} \partial L / \partial θ_{2}] = [1 + 4 θ_{1} + 2 θ_{2} - 1 + 2 θ_{1} + 2 θ_{2}]

\tilde{H} (\tilde{θ}) = [\partial L^{2} / \partial θ_{1}^{2} \partial L^{2} / \partial θ_{2} \partial θ_{1} \partial L^{2} / \partial θ_{1} \partial θ_{2} \partial L / \partial θ_{2}^{2}] = [4222]

Starting Guess: \tilde{θ_{o}} = [00] \Rightarrow g_{o} = [1 - 1]

\tilde{H} (θ_{o}) Δ \tilde{θ} = - \tilde{g}_{o} \Rightarrow [4222] [Δ θ_{1} Δ θ_{2}] = - [1 - 1]

4d1 + 2d2 = -1 2d1 + 2d2 = +1 2d1 = -2 ⇒ d1 = -1 ⇒ d2 = 3/2 $Δ \tilde{θ} = [Δ θ_{1} Δ θ_{1}] = [- 1 3/2]$ or

Δ \tilde{θ} = [Δ θ_{1} Δ θ_{2}] = [4222]^{T} [- 1 + 1] = \frac{1}{4} [2 - 2 - 2 4] [- 1 + 1] = \frac{1}{4} [46] = [1 3/2]

Lecture 4

From the problem sheet

matrix H = \nabla_{(θ_{1}, θ_{2}, θ_{3}, θ_{4})}^{2} L = \frac{\partial ^{2} L}{\partial θ _{1}^{2}} \frac{\partial ^{2} L}{\partial θ _{2} \partial θ _{1}} \frac{\partial ^{2} L}{\partial θ _{3} \partial θ _{1}} \frac{\partial ^{2} L}{\partial θ _{4} \partial θ _{1}} \frac{\partial ^{2} L}{\partial θ _{1} \partial θ _{2}} \frac{\partial ^{2} L}{\partial θ _{2}^{2}} \frac{\partial ^{2} L}{\partial θ _{3} \partial θ _{2}} \frac{\partial ^{2} L}{\partial θ _{4} \partial θ _{2}} \frac{\partial ^{2} L}{\partial θ _{1} \partial θ _{3}} \frac{\partial ^{2} L}{\partial θ _{2} \partial θ _{3}} \frac{\partial ^{2} L}{\partial θ _{3}^{2}} \frac{\partial ^{2} L}{\partial θ _{4} \partial θ _{3}} \frac{\partial ^{2} L}{\partial θ _{1} \partial θ _{4}} \frac{\partial ^{2} L}{\partial θ _{2} \partial θ _{4}} \frac{\partial ^{2} L}{\partial θ _{3} \partial θ _{4}} \frac{\partial ^{2} L}{\partial θ _{4}^{2}}

Hessian is symmetric so we need $D (D + 1) /2$ terms, or $1.5 (D^{2} + D)$ or $30$

Lecture 5

For least squares we need N (num points) to be greater than or equal to D (the dimension) for there to be unique solutions, at N=D we have always a perfect solution (think a line between two points in 2D)

Gauss Newton always descends, even when Newton doesnt, which is interesting, given that it uses only approximation.

A combo of Gauss Newton and steepest descent is called Levenberg - Marquard method

We can define the “error” by least square as residuals and we try to optimise the objective function based on the residual

L (θ) = \frac{1}{2} i \sum N (Y (x_{i}; θ) - y_{i})^{2} = \frac{1}{2} \sum r_{i}^{2} = \frac{1}{2} ∣∣ \tilde{r} ∣ ∣^{2}

Called L2 Norm because we are taking square of the components then summing. Other options:

L1 Norm: $\frac{1}{2} \sum ∣ r_{i} ∣$
L $\infty$ norm: $\frac{1}{2} max (∣ r_{i} ∣)$

Outlier can significantly distort the “best” result with the L2 norm. Because we’re squaring the residual, meaning that single outliers skew the data much more

l6

if $N < D => J^{T} J$ is singular ⇒ Solution if not unique, therefore it depends on the initial guess.

Even without a perfect fit in N >> D, it will converge in a single step, its dependent on the gradient not the residual. Doesnt mean its a good fit though…

Rosenbrock function as an example for testing minimization algorithms

L_p norm is denoted as $∣∣ \tilde{r} ∣ ∣_{p} = \sum_{i} ∣ r_{i} ∣^{p}$ In terms of difficulty: Unconstrained Optimisation < Equality Constraint < Inequality Constraint

l7

Constrained Optimisation | \ __ By Reduction? (If Possible) \ __ Lagrange Multiplier (Take it to a higher dimension) $L^{*} = L - \sum_{l = 1}^{n_{c}} λ_{i} c_{i}$

Active vs Inactive constraint, for inequalities, if our solution lies on the constraint (active, its basically an equality constraint, but if it doesnt, its an inactive constraint.

KKT condition is an extension of Lagrange, allowing for uncertainty on whether our constraints are active.

L (θ_{1}, θ_{2}) = (θ_{1} - 2)^{2} + (θ_{2} - 1)^{2} θ_{1}^{2} - θ_{2} ⩽ 0 \Rightarrow θ_{2} - θ_{1}^{2} ⩾ 0 θ_{1} + θ_{2} ⩽ 2 \Rightarrow 2 - θ_{1} - θ_{2} ⩾ 0 L^{L} (θ_{1}, θ_{2}, λ_{1}, λ_{2}) = (θ_{1} - 2)^{2} + (θ_{2} - 1)^{2} + λ_{1} (θ_{2} - θ_{1}^{2}) + λ_{2} (2 - θ_{1} - θ_{2}) KKT Conditions: {\frac{\partial R ^{L}}{\partial θ _{1}} = 0 \frac{\partial L ^{L}}{\partial θ _{2}} = 0 & λ_{1} ⩾ 0 & λ_{2} ⩾ 0 (θ_{2} - θ_{1}^{2}) ⩾ 0 & (2 - θ_{1} - θ_{2}) ⩾ 0 λ_{1} (θ_{2} - θ_{1}^{2}) = 0 & λ_{2} (2 - θ_{1} - θ_{2}) = 0

Augmented Lagrange uses both Lagrange and penalty methods

In unconstrained optimisation: $L : ℜ^{D} \to ℜ$ we therefore have D “knobs” to turn. Gradient in any direction should be equal to 0 at minimum ⇒ $Δ_{θ} L = \tilde{0}$

Variational calculus Functional Optimisation

l8

Penalty method

L_{p} = x^{2} + μ (x - 1)^{2} x min x^{2} s.t. x = 1 True solution: x = 1 x min L_{p} \Rightarrow Gradient = \frac{\partial R _{p}}{\partial x} = 2 x + μ 2 (x - 1) = 0 \Rightarrow x^{*} = \frac{μ}{μ + 1} if μ = 1 \Rightarrow x^{*} = 0.5 μ = 10 \Rightarrow x^{*} = \frac{10}{11} \approx 0.97

As mu increases, we converge towards 1, but slowly

Penalty Method	Langrange Method
Inexact Solution	Exact Solution
Choosing $μ$ is non trivial	Saddle Problem
$L_{p} (x) = x^{2} + μ (x - 1)^{2}$	$L^{L} (x, λ) = x^{2} - λ (x - 1)$

L_{A L} = (x; λ) = x^{2} - λ (x - 1) + μ (x - 1)^{2}

Assume $λ = 0 & μ = 1$ Solve for $x^{*}$ $λ \leftarrow λ - \frac{μ}{2} (x - 1)$ Update $λ$

(1) $λ = 0, μ = 1 \Rightarrow x^{*} = 0.5$

Update $λ = 0 - \frac{1}{2} (0.5 - 1) = 0.25$ (2)

L_{A L} (x) = Gradient = x^{2} - 0.25 (x - 1) + 1 (x - 1)^{2} 2 x - 0.25 + 2 (x - 1) = 0 x = \frac{2 + 0.25}{4} = 0.575

For this:

\mathcal{L}_A(\theta, \lambda, \sigma, \mu) &= L(\theta) - \sum_{j=1}^{m} \lambda_j \epsilon_j(\theta) + \mu \sum_{j=1}^{m} \epsilon_j^2(\theta) \\&\quad- \sum_{k=1}^{n} \lambda_{m+k} I_k(\theta) + \mu \sum_{k=1}^{n} \max(0, -I_k(\theta) )^2 \end{align}

where m is the amount of constrained conditions, and n is the inequality conditions

l9

Wednesday Mar 19th talks about exam

, used for shorthand for derivate:

L_{, i} \equiv \frac{\partial L}{\partial θ _{i}}

Backtracking Question Example

f (x) = 2 x_{1}^{2} + x_{2}^{2} - 2 x_{1} x_{2} + x_{1} - x_{2}

Start at point $x_{0} = [2, 1]^{⊺}$ , and calculate $\nabla f (x_{0})$ and steepest descent direction $p_{0}$

\nabla f (x) = {4 x_{1} - 2 x_{2} + 1 2 x_{2} - 2 x_{1} - 1}

\nabla f (x_{0}) = {8 - 2 + 1 2 - 4 - 1} = {7 - 3}

p_{0} = - \nabla f (x_{0}) = {- 7 3}

(b) Implement one iteration of backtracking line search with parameters $α_{0} = 1, ρ = 0.5$ , and $c = 0.3$ to find a suitable step size. Show all iterations of the backtracking procedure. [7 marks]

f (x_{0} + α_{0} p) \leq f (x_{0}) + c α \nabla f (x_{0})^{⊺} p_{0}

$f (x_{0} + α_{0} p) = f ([- 5 4]) = 50 + 16 + 40 - 5 - 4 = 97$ $f (x_{0}) = 8 + 1 - 4 + 2 - 1 = 6$ $α c \nabla f (x_{0})^{⊺} * p_{0} = 0.3 * [7, - 3] * [- 7, 3] = 0.3 * - 58 = - 17.4$

Alpha too large! $α = 0.5$

For $α = 0.5$ :

x_{0} + α p_{0} = [2, 1] + 0.5 \cdot [- 7, 3] = [- 1.5, 2.5] f ([- 1.5, 2.5]) = 2 (- 1.5)^{2} + (2.5)^{2} - 2 (- 1.5) (2.5) + (- 1.5) - 2.5 = 4.5 + 6.25 + 7.5 - 1.5 - 2.5 = 14.25

Check Armijo: $14.25 \leq 6 + 0.3 \cdot 0.5 \cdot (- 58) = 6 - 8.7 = - 2.7$ The condition is still not satisfied. Reduce $α$ again: For $α = 0.25$ :

x_{0} + α p_{0} = [2, 1] + 0.25 \cdot [- 7, 3] = [0.25, 1.75] f ([0.25, 1.75]) = 2 (0.25)^{2} + (1.75)^{2} - 2 (0.25) (1.75) + 0.25 - 1.75 = 0.125 + 3.0625 - 0.875 + 0.25 - 1.75 = 0.8125

Check Armijo: $0.8125 \leq 6 + 0.3 \cdot 0.25 \cdot (- 58) = 6 - 4.35 = 1.65$ Since $0.8125 \leq 1.65$ , the Armijo condition is satisfied. We accept $α = 0.25$ .

Question 2: BFGS Method [20 marks]

Consider the function $f (x) = (x_{1} - 2)^{2} + 2 (x_{2} - 1)^{2}$ and we are using the BFGS method to minimize it.

(a) Starting with the initial point $x_{0} = [0, 0]^{T}$ and initial Hessian approximation $B_{0} = I$ (the identity matrix), calculate the first search direction $p_{0}$ and verify it is a descent direction. [4 marks]

(b) Assuming an exact line search gives a step size of $α_{0} = 1$ , find the new point $x_{1}$ and calculate the gradient at this point. [4 marks]

(c) Using the BFGS update formula, compute the new Hessian approximation $B_{1}$ . Show all steps in your calculation. [8 marks]

(d) Calculate the second search direction $p_{1}$ using the updated Hessian approximation. [4 marks]

Solution 2:

(a) First search direction

For the function $f (x) = (x_{1} - 2)^{2} + 2 (x_{2} - 1)^{2}$ , the gradient is: $\nabla f (x) = [2 (x_{1} - 2) 4 (x_{2} - 1)]$

At $x_{0} = [0, 0]^{T}$ : $\nabla f (x_{0}) = [2 (0 - 2) 4 (0 - 1)] = [- 4 - 4]$

With $B_{0} = I$ , the search direction is: $p_{0} = - B_{0}^{- 1} \nabla f (x_{0}) = - I^{- 1} [- 4 - 4] = [44]$

To verify this is a descent direction, we check that $\nabla f (x_{0})^{T} p_{0} < 0$ : $\nabla f (x_{0})^{T} p_{0} = [- 4, - 4] \cdot [4, 4] = - 16 - 16 = - 32 < 0$

Therefore, $p_{0}$ is indeed a descent direction.

(b) New point and gradient

Using step size $α_{0} = 1$ : $x_{1} = x_{0} + α_{0} p_{0} = [0, 0] + 1 \cdot [4, 4] = [4, 4]$

Calculate the gradient at $x_{1}$ : $\nabla f (x_{1}) = [2 (4 - 2) 4 (4 - 1)] = [412]$

(c) BFGS update

For the BFGS update, we need:

$s_{0} = x_{1} - x_{0} = [4, 4] - [0, 0] = [4, 4]$
$y_{0} = \nabla f (x_{1}) - \nabla f (x_{0}) = [4, 12] - [- 4, - 4] = [8, 16]$

Now we can use the BFGS update formula: $B_{1} = B_{0} - \frac{B _{0} s _{0} s _{0}^{T} B _{0}}{s _{0}^{T} B _{0} s _{0}} + \frac{y _{0} y _{0}^{T}}{y _{0}^{T} s _{0}}$

Calculate each term:

$B_{0} s_{0} = I \cdot [4, 4] = [4, 4]$
$s_{0}^{T} B_{0} s_{0} = [4, 4] \cdot [4, 4] = 16 + 16 = 32$
$B_{0} s_{0} s_{0}^{T} B_{0} = [4, 4] \cdot [4, 4]^{T} = [16161616]$
$y_{0}^{T} s_{0} = [8, 16] \cdot [4, 4] = 32 + 64 = 96$
$y_{0} y_{0}^{T} = [8, 16] \cdot [8, 16]^{T} = [64128128256]$

Substituting: $B_{1} = I - \frac{1}{32} [16161616] + \frac{1}{96} [64128128256]$

$B_{1} = [1001] - [0.5 0.5 0.5 0.5] + [0.667 1.333 1.333 2.667]$

$B_{1} = [1.167 0.833 0.833 3.167]$

(d) Second search direction

To find the second search direction, we need to solve the system $B_{1} p_{1} = - \nabla f (x_{1})$ :

$[1.167 0.833 0.833 3.167] p_{1} = - [412]$

We can find $p_{1}$ by computing $B_{1}^{- 1}$ : $det (B_{1}) = 1.167 \cdot 3.167 - 0.833 \cdot 0.833 = 3.027$

$B_{1}^{- 1} = \frac{1}{3.027} [3.167 - 0.833 - 0.833 1.167] = [1.046 - 0.275 - 0.275 0.386]$

Therefore: $p_{1} = - B_{1}^{- 1} \nabla f (x_{1}) = - [1.046 - 0.275 - 0.275 0.386] [412]$

$p_{1} = - [4.184 - 3.3 - 1.1 + 4.632] = - [0.884 3.532] = [- 0.884 - 3.532]$

This is the second search direction according to the BFGS method.

Quartz 4

Explorer