This section is mainly based on Brunton and Kutz (2022), Section 4.2.
We extend our theory of curve fitting to a non-linear function \(f(X, c)\) with coefficients \(c\in\mathbb{R}^m\) and our \(n\)\(X_{k-}\). We assume that \(m < n\).
If we define our root-mean-square error depending on \(c\) as \[
E_2(c) = \sum_{k=1}^n(f(X_{k-}, c) - y_k )^2
\] and we can minimize with respect to each \(c_j\) resulting in a \(m \times m\) system \[
\sum_k (f(X_{k-}, c) - y_k)\frac{\partial f}{\partial c_j} = 0\quad\text{for}\quad j=1, 2, \ldots, m.
\]
Depending on the properties of the function at hand it can be guaranteed to find an extrema or not. For example convex functions have guarantees of convergence while non-convex functions can have chelating features that make it hard to work with optimization algorithms.
To solve such a system we employ iterative solvers that use an initial guess. Let us look at the most common, the gradient descent method.
6.1 Gradient Descent
For a higher dimensional system or function \(f\) the gradient must be zero \[
\nabla f(x) = 0
\] to know that we are in an extrema. Since we can have saddle points this is not the sole criteria but a necessary one. Gradient descent, as the name suggest uses the gradient as direction in an iterative algorithm to find a minimum.
The idea is basically, if you are lost on a mountain in the fog and you can not see the path, the fastest and a reliable way that only uses local information is to follow the steepest slope down.
Warning
A function does not necessarily experience gravity in the same way as we do, so please do not try this in real live, i.e. cliffs tend to be hard to walk down.
We express this algorithm in terms of the iterations \(x^k\) for guesses of the minimum with the updates \[
x^{(k+1)} = x^{(k)} - \delta\, \nabla f(x^{(k)})
\] where the parameter \(\delta\) defines how far along the gradient descent curve we move. This formula is an update for a Newton method where we use the derivative as the update function. This leaves us with the problem to find an algorithm to determine \(\delta\).
Again, we can view this as an optimization problem for a new function \[
F(\delta) = f(x^{(k+1)}(\delta))
\] and \[
\partial_\delta F = -\nabla f(x^{(k+1)})\nabla f(x^{(k)}) = 0.
\tag{6.1}\]
Now the interpretation of Equation 6.1 is that we want that the gradient of the current step is orthogonal to the gradient of the next step.
As \(\delta\) is tricky to compute we do not update it but prescribe it. This will not grant us the optimal convergence (if there is convergence) but if we choose it right we still get convergence.
Note
In machine learning the parameter \(\delta\) is often called the learning rate.
Figure 6.2: Line fit with gradient descent for different number of iterations and learning rate 2e-3.
The above algorithm uses the entire set \(X\) for the computation. For a large enough set \(X\) this is quite cost intense, even if it is still cheaper than computing \(X^\dagger\).
6.2 Stochastic Gradient Descent
In order to reduce cost we can randomly select some points of our training set and only train with those. Obviously the computation of the gradient becomes much faster. We call this method Stochastic Gradient descent (SGD).
In Figure 6.3 we see the convergence for randomly selecting 1, 3, and 6 indices of our possible 10.
The downside of the SGD algorithm is that the algorithm does not settle down for a long time and will jump. On the other side, it might not get stuck in local minima as often.
One possibility to try to get the strength of both is to use SDG to get a good guess for your initial value and SD for the fine tuning.
Figure 6.3: Line fit with stochastic gradient descent with 1 or 3 samples and 200 iterations as well as the 3 sample version as initial guess for GD with 100 iterations.
6.3 Categorical Variables
Even with our excursion to non-linear regression we still had somewhat regular data to work with. This is not always the case. Sometimes there are trends in the data, like per month, or day. The inclusion of categorical variables can help to control for trends in the data.
We can integrate such variables to the regressor by adding columns to the matrix \(X\) for each of the categories. Note, they can be interpreted as to correspond to the offset (the constant \(1\)) so this column can be omitted and each category gets a separate offset.
We can see this in action in the following example. We investigate the unemployment data in Austria. There is a strong seasonality Figure 6.4 (b) in the data. This is largely due to the fact that the Austrian job market has a large touristic sector with its season and the construction industry employs less people during summer.
For the regression Figure 6.4 (b) we can see that this captures the seasonal change quite well.
(a) Regression with categorical variables per month.
(b) Seasonality of the unemployment average over the years.
Figure 6.4: Unemployment data from Austria for the years 2010 to 2017.
Note
There is also quite a difference between man and woman that could be categorized separately.
We wrap up this section about regression by talking more abstract about the regression of linear systems and some general thoughts about the selection of the model and consequences.
Brunton, Steven L., and J. Nathan Kutz. 2022. Data-Driven Science and Engineering - Machine Learning, Dynamical Systems, and Control. 2nd ed. Cambridge: Cambridge University Press.