7 Optimizers

As we have seen in the previous section the task of regression usually results in an optimization problem. It is worth investigating this further by looking closely on the \[ A x = b \tag{7.1}\] problem for different dimensions of \(A\).

We investigate the impact of restricting our solution not just by Equation 7.1 but with the help of the \(\ell_1\) and \(\ell_2\) norm imposed on the solution \(x\). As we have seen in Figure 5.1 the choice of norm has an implication on the result and the same is true here.

Important

This section is mainly based on Brunton and Kutz (2022), Section 4.3.

7.1 Over-Determined Systems

We speak of an over-determined system if we have more rows than columns, i.e. \(A\) is tall and skinny and in general there is no solution to Equation 7.1 but rather we minimize the error according to a norm, see Section 5.1. If we further impose a restriction on \(x\) we can select a more specific solution.

The generalized form is \[ x = \underset{v}{\operatorname{argmin}} \|Av - b\|_2 + \lambda_1 \|v\|_1 + \lambda_2\|v\|_2 \tag{7.2}\] where the parameters \(\lambda_1\) and \(\lambda_2\) are called the penalization coefficients, with respect to the norm. Selecting these coefficients is the first step towards model selection.

Let us have a look at this in action for solving a random system with different parameters \(\lambda_1\) and setting \(\lambda_2\).

7.1.1 LASSO

The least absolute shrinkage and selection operator LASSO solves Equation 7.2 with \(\lambda_1 > 0\) and \(\lambda_2=0\), i.e. only optimizing with the \(\ell_1\) norm. The theory tells us that for increasing \(\lambda_1\) we should get more and more zeros in our solution \(x\).

Show the code for the figure

import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import minimize
%config InlineBackend.figure_formats = ["svg"]
np.random.seed(6020)

m = 500
n = 100
A = np.random.rand(m, n)
b = np.random.rand(m)
x0 = np.linalg.pinv(A) @ b

optimize = lambda x, A, b, lam, norm: np.linalg.norm(A @ x - b, ord=norm[0]) +\
    lam * np.linalg.norm(x, ord=norm[1])

fig = plt.figure()
axs = []
axs.append(fig.add_subplot(4, 1, 1))
axs.append(fig.add_subplot(4, 3, 10))
axs.append(fig.add_subplot(4, 1, 2))
axs.append(fig.add_subplot(4, 3, 11))
axs.append(fig.add_subplot(4, 1, 3))
axs.append(fig.add_subplot(4, 3, 12))

for i, lam in enumerate([0, 0.1, 0.5]):
    res = minimize(optimize, args=(A, b, lam, [2, 1]), x0=x0)
    axs[i * 2].bar(range(n), res.x)
    axs[i * 2].text(5, 0.05, rf"$\lambda_1={lam}$")
    axs[i * 2].set_xlim(0, 100)
    axs[i * 2].set_ylim(-0.1, 0.1)
    axs[i * 2 + 1].hist(res.x, 20)
    axs[i * 2 + 1].text(-0.08, 50, rf"$\lambda_1={lam}$")
    axs[i * 2 + 1].set_xlim(-0.1, 0.1)
    axs[i * 2 + 1].set_ylim(0, 70)
axs[0].set_xticks([])
axs[2].set_xticks([])
axs[3].set_yticks([])
axs[5].set_yticks([])


plt.subplots_adjust(top = 0.99, bottom=0.1, hspace=0.35, wspace=0.2)
plt.show()

Figure 7.1: LASSO regression coefficients of an over-determined system with 500 constraints and 100 unknowns. Top three show the values for the solution and the bottom three a histogram of this solution. The label with lambda maps the two.

The last row of Figure 7.1 confirms this quite impressively, interesting enough the solution also becomes positive.

7.1.2 RIDGE

The Ridge Regression solves Equation 7.2 with \(\lambda_1 = 0\) and \(\lambda_2 > 0\), i.e. only optimizing with the \(\ell_2\) norm. The theory tells us that for increasing \(\lambda_2\) we should get more and more zeros in our solution \(x\).

Show the code for the figure

import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import minimize
%config InlineBackend.figure_formats = ["svg"]
np.random.seed(6020)

m = 500
n = 100
A = np.random.rand(m, n)
b = np.random.rand(m)
x0 = np.linalg.pinv(A) @ b

optimize = lambda x, A, b, lam, norm: np.linalg.norm(A @ x - b, ord=norm[0]) +\
    lam * np.linalg.norm(x, ord=norm[1])

fig = plt.figure()
axs = []
axs.append(fig.add_subplot(4, 1, 1))
axs.append(fig.add_subplot(4, 3, 10))
axs.append(fig.add_subplot(4, 1, 2))
axs.append(fig.add_subplot(4, 3, 11))
axs.append(fig.add_subplot(4, 1, 3))
axs.append(fig.add_subplot(4, 3, 12))

for i, lam in enumerate([0, 0.1, 0.5]):
    res = minimize(optimize, args=(A, b, lam, [2, 2]), x0=x0)
    axs[i * 2].bar(range(n), res.x)
    axs[i * 2].text(5, 0.05, rf"$\lambda_2={lam}$")
    axs[i * 2].set_xlim(0, 100)
    axs[i * 2].set_ylim(-0.1, 0.1)
    axs[i * 2 + 1].hist(res.x, 20)
    axs[i * 2 + 1].text(-0.08, 15, rf"$\lambda_2={lam}$")
    axs[i * 2 + 1].set_xlim(-0.1, 0.1)
    axs[i * 2 + 1].set_ylim(0, 20)
axs[0].set_xticks([])
axs[2].set_xticks([])
axs[3].set_yticks([])
axs[5].set_yticks([])


plt.subplots_adjust(top = 0.99, bottom=0.1, hspace=0.35, wspace=0.2)
plt.show()

Figure 7.2: Ridge regression coefficients of an over-determined system with 500 constraints and 100 unknowns. Top three show the values for the solution and the bottom three a histogram of this solution. The label with lambda maps the two.

Exercise 7.1 (Implement the above optimization yourself.) Fill out the missing parts:

Code fragment for implementation.

import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import minimize
np.random.seed(6020)

m = 500
n = 100
A = # Random matrix with m rows and 100 columns 
b = np.random.rand(m)
x0 = # Optimal 2 norm solution without penalization as initial start

optimize = # function to optimize

fig = plt.figure()
axs = []
axs.append(fig.add_subplot(4, 1, 1))
axs.append(fig.add_subplot(4, 3, 10))
axs.append(fig.add_subplot(4, 1, 2))
axs.append(fig.add_subplot(4, 3, 11))
axs.append(fig.add_subplot(4, 1, 3))
axs.append(fig.add_subplot(4, 3, 12))

for i, lam in enumerate([0, 0.1, 0.5]):
    res = minimize(optimize, args=, x0=x0)    # use your correct arguments here
    axs[i * 2].bar(range(n), res.x)
    axs[i * 2].text(5, 0.05, rf"$\lambda={lam}$")
    axs[i * 2].set_xlim(0, 100)
    axs[i * 2].set_ylim(-0.1, 0.1)
    axs[i * 2 + 1].hist(res.x, 20)
    axs[i * 2 + 1].text(-0.08, 15, rf"$\lambda={lam}$")
    axs[i * 2 + 1].set_xlim(-0.1, 0.1)
    axs[i * 2 + 1].set_ylim(0, 20)
axs[0].set_xticks([])
axs[2].set_xticks([])
axs[3].set_yticks([])
axs[5].set_yticks([])

plt.subplots_adjust(top = 0.99, bottom=0.1, hspace=0.35, wspace=0.2)
plt.show()

7.2 Model Selection/Identification and over-/underfitting

Let us use the results we have obtain so far for a discussion on model selection.

So far, we have mostly explicitly proposed a model that we think will fit our data and we have seen that even it this case we can still choose multiple parameters to fin tune our selection.

Now consider the other possibility, we have data where the model is unknown. For example, in Example 5.1 we stopped with degree 2 for our polynomial because we know about Newton’s principles, if we don’t know it, we might extend the model for a higher degree.

One of the leading assumptions to use in such a case is:

Among competing hypotheses, the one with the fewest assumptions should be selected, or when you have two competing theories that make exactly the same predictions, the simpler one is the more likely. - Occam’s razor

This plays an intimate role in over- and underfitting of models. To illustrate this we recall Example 5.2 with Figure 5.2 as seen below once more.

Figure 7.3: Fitting for different degrees of polynomial \(m\)

For \(m=1\), a straight line, we have an underfitted model. We can not adequately capture the underlying model, at least not in the entire region.

If we move to \(m=16\) and the extreme \(m=300\) we see an overfitted system. The \(m=16\) curve follows clusters of points too close, e.g. in the region around \(x=-2\), this is more pronounced for \(m=300\) where we quite often closely follow our observations but between them we clearly overshoot.

In this way we can also say that an overfitted system follows the training set to closely and will not generalize good for another testing/evaluation set.

As a consequence model selection should always be followed by a cross-validation. Meaning we need to check if our model is any good.

A classic method is the k-fold cross validation:

Take random portions of your data and build a model. Do this \(k\) times and average the parameter scores (regression loadings) to produce a cross-validated model. Test the model predictions against withheld (extrapolation) data and evaluate whether the model is actually any good. - (see Brunton and Kutz 2022, 159)

As we can see, there are a lot of further paths to investigate but for now this concludes our excursion into regression.