Neural Networks and Deep Learning

Neural Networks (NNs) are inspired by the work of Hubel and Wiesel (1962) about the visual cortex of cats. They showed how hierarchical layers of cells process visual stimuli. This insight was the base for the first mathematical model of a NN, published in 1980 (Fukushima (1980)), that would in the modern terminology be best described as a deep convolutional neural network (DCNN). It contains a multi-layer structure, convolution, max pooling, and nonlinear dynamical nodes.

The current high interest in NNs was arguably kickstarted with Krizhevsky, Sutskever, and Hinton (2012) and their introduction of ImageNet. The used with high resolution images, facilitated with 15 million labels in over 22 000 categories. This transformed the field of computer vision tasks for classification and identification. All of this was made possible by the large labelled dataset and the available compute power¹.

At its core a neural network consists of multiple layers (some with specific names) of interconnected nodes (or neurons).

Figure 1: Illustration of a generic Neural Network. On the left in green we have the input, on the right in purple the output. In between we have multiple layers of various size. The connections (mapping) between the neurons can be defined but they do not have to be fully interconnected. We highlight some neurons and their connections simply for reference.

Similar to our view of the brain, the neurons process information and make decisions based on this information and their state (weights, biases, activation functions, etc.). Given enough data, computing power, the correct architecture, and hyperparameters they are universal functions approximation machines. They can be used to learn about complex relationships between the input and the output without requiring any rules or programming. They are ideal for pattern recognition, data classification, and regression analysis.

Following the discussion of (Brunton and Kutz 2022, chap. 6), mathematically speaking the main task in NNs lies in optimization. Specifically, NNs optimize over a composite function \[ \underset{A_j}{\operatorname{arg min}}\, \big( f_m(A_m, f_{m-1}(A_{m-1}, \cdots, f_2(A_2, f_1(A_1, x)))) + \lambda g(A_j) \big) \]

where \(x\) denotes the input, the matrix \(A_k\) denotes the weights connecting layer \(k\) with layer \(k+1\), \(\lambda\) and \(g\) are regularizations (bias) and \(f_k\) is the activation function \(k\).

In the context of deep learning models this is often written in the compact and generic form \[ \underset{\Theta}{\operatorname{arg min}}\, f_\Theta(x) \] for \(\Theta\) denoting the weights of the network and \(f\) the characteristics of the network (layers, structure, activation functions, etc.).

The training of the network uses the labelled data to find the best weights such that the labels are recovered best from the observations (input), i.e. map \(x_j\) to \(y_j\). This is often done via backpropagation algorithms and stochastic gradient descent.

Let us try to demystify the mathematics behind NN

The main building block is a so called perceptron (a single neutron, see Figure 2) that takes its set of inputs, multiplies each by a weight, sums them up, adds a bias term, and applies a function to produce an output.

Figure 2: Basic building block *perceptron* with its input, weights, bias, and activation function.

Note

A perceptron is the equivalent of the scikit-learn classifier SGDClassifier with loss="perceptron".

For multiple perceptions in one layer we get weights for each of them and they are combined into the matrix \(A\), see Figure 3.

Figure 3: First weight matrix for a generic NN, where no connection is present a 0 is inserted.

If we extend again to a multilayer NN with multiple neurons in each layer we can simplify this a bit by limiting us for a moment to a linear NN without a bias. In this case the functions \(f_k\) are all linear maps (as we will see shortly, it is in fact the identity) and the composition is the matrix-matrix computation \[ y = A_m A_{m-1} \cdots A_2 A_1 x. \]

The matrices might be sparse or dense, depending on the connections they describe. The resulting system is most likely under-determined and requires some constraints to select a unique solution, see Kandolf (2025).

To make it more hands on, let us considering a one layer NN for our cats and dogs example with the structure illustrated in Figure 4.

Figure 4: A single layer structure for the cats vs. dogs classification.

Mathematically, this becomes \[ y = A x. \] and to connects back to the perceptron as \(A = w^\mathsf{T}\).

One possible solution is via the Moore-Penrose pseudo inverse² \(A = Y X^\dagger\), where all our training data is combined into \(X\) and \(Y\) are the corresponding labels. This in return, tells us that a single layer linear NN can be used to build a least-square fit.

Of course there are other possibilities to solve this system (as discussed in Kandolf (2025)), e.g. the LASSO method or RIDGE can be used to promote certain properties on the matrix \(A\).

Show the code for the figure

import numpy as np
import scipy
import requests
import io
from sklearn import linear_model
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]


def myplot(y):
    plt.figure()
    len = y.shape[0]
    plt.bar(range(len), y)
    plt.plot([-0.5, len - 0.5], [0, 0], "k", linewidth=1.0)
    plt.plot([len // 2 - 0.5, y.shape[0] // 2 - 0.5], [-1.1, 1.1], "r-.", linewidth=3)
    plt.yticks([-0.5, 0.5], ["cats", "dogs"], rotation=90, va="center")
    plt.text(len // 4, 1.05, "dogs")
    plt.text(len // 4 * 3, 1.05, "cats")
    plt.gca().set_aspect(len / (2 * 3))


response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/catData_w.mat")
cats_w = scipy.io.loadmat(io.BytesIO(response.content))["cat_wave"]

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/dogData_w.mat")
dogs_w = scipy.io.loadmat(io.BytesIO(response.content))["dog_wave"]

s=40
X_train = np.concatenate((dogs_w[:, :s], cats_w[:, :s]), axis=1).T
y_train = np.repeat(np.array([1, -1]), s)
X_test = np.concatenate((dogs_w[:, s:], cats_w[:, s:]), axis=1).T
y_test = np.repeat(np.array([1, -1]), 80-s)

A = y_train @ np.linalg.pinv(X_train.T)
y_test_pinv = np.sign(A @ X_test.T)
myplot(y_test_pinv)

# Same result as the above computation via the scikit-learn functions.
lsq = linear_model.LinearRegression(fit_intercept=False).fit(X_train, y_train)
y_test_lsq = np.sign(lsq.predict(X_test))
np.testing.assert_allclose(A, lsq.coef_)
#myplot(y_test_lsq)

ridge = linear_model.Ridge(random_state=6020).fit(X_train, y_train)
y_test_ridge = np.sign(ridge.predict(X_test))
myplot(y_test_ridge)
plt.show()

Note that we compressed via the matrix \(A\) (actually it is just a vector) our input of 1024 pixels to 1 value. Depending on the method we can promote sparsity (not all inputs are used for the output) or other properties.

Nevertheless, we can only cover a limited range of possibilities with linear functions. Therefore, the extension to arbitrary activation functions is a natural progression.

As a reference: The IBM Personal Computer (PC) was introduced in 1981, after the first NN.↩︎
See (Kandolf 2025, Definition 4.3) or access it directly via the Linkf↩︎