We use compression in everyday life extensively. Nevertheless, it almost always requires us to have the high resolution data/measurement and than compress it down. Here we discuss the reverse, where we collect relatively few compressed or random measurements and then infer a sparse representation.
Until recent developments the computation was non-polynomial (NP) hard problem but now there are algorithms to reconstruct the full signal with high probability using convex algorithms.
Definition 13.1 (Compressed Sensing) For a signal \(x\) that is \(K\)-sparse in \(\mathcal{B}\) with matrix \(B\) we need \(n\) measurements and can compress them to \(K\).
In compressed sensing we collect \(p\) randomly chosen or compressed measurements and solve for the \(K\) elements of \(s\) that are not zero, with \(K < p \ll n\).
We can represent \(p\) linear measurements of \(x\) as the matrix \(C\in\mathbb{R}^{p\times n}\) and thus \[
y = C x.
\]
Relating this back to \(s\) and \(\mathcal{B}\) we get \[
y = C x = C B s = D s,
\] as an under-determined system. To get the optimal solution, i.e. the smallest \(K\), we optimize \[
\check{s} = \operatorname{argmin}_s \| s \|_0, \quad \text{subject to} \quad y = C B s.
\] Here \(\|\cdot\|_0\) is \(\ell_0\) pseudo norm that measures the number of non-zero elements.
As we will see, the selection of the matrix \(C\) is important to allow for the optimization to be such that we do not need brute-force to look for \(\check{s}\). This brute-force search is combinatorial in \(n\) and \(K\), for a fixed (known) \(K\) but significantly larger for unknown \(K\).
Under certain condition on \(C\) we can relax the conditions to a convex \(\ell_1\)-minimization \[
\check{s} = \operatorname{argmin}_s \| s \|_1, \quad \text{subject to} \quad y = C B s.
\tag{13.1}\]
In order to have Equation 13.1 converge with high properbility we need to following assumptions to be met (precise description will follow):
The measurement matrix \(C\) must be incoherent with respect to the sparsifying basis \(\mathcal{B}\), i.e. the rows of \(C\) do not correlate with the columns of \(B\).
The number of measurements \(p\) must be sufficiently large, on the order of \[
p \approx \mathcal{O}\left(K\log\left(\frac{n}{K}\right)\right) \approx k_1\, K \log\left(\frac{n}{K}\right),
\] where \(k_1\) depends on how incoherent \(C\) and \(B\) are.
Note
In Figure 12.1 we illustrated compressed sensing with a \(4016 \times 6016\) image. With our 5% of Fourier coefficients this puts the sparsity \(K = 0.05 \times 4016 \times 6016 \approx 1 208 013\) and with \(k_1=3\) we would stay with \(p\) 3.6 million measurements and about 15% of the original pixels.
The trick part is how to select them, which is why compressed sensing is not used much in imaging, except for some cases like magnetic resonance imaging (MRI) for patience that can not stay still or sedation is risky.
The idea of the two conditions is to make \(CB\) act more or less like a unitary transformation on \(K\)-sparse vectors \(s\), preserving relative distance between vectors and allowing for the \(\ell_1\) convex optimization. We will see this in therms of the restricted isometry property (RIP).
We know of the Shannon-Nyquist sampling theorem:
If a function \(f(t)\) contains no frequencies higher than \(b\) hertz, then it can be completely determined from its ordinates at a sequence of points spaced less than \(\frac{1}{2b}\) seconds apart. - see Wikipedia
telling us that we need a sampling rate of at least double the highest frequency to recover the signal properly. However, this is only true for signals with broadband frequency content and this is hardly ever the case for uncompressed signals. As we expect an uncompressed signal can be expressed as a sparse signal in the correct basis we can relax the Shannon-Nyquist theorem. Nevertheless, we need precise timing for our measurements and the recovery is not guaranteed, it is possible with high probability.
Note
There are alternative formulations based on so called greedy algorithms that determine the sparsity through an iterative matching pursuit problem, e.g. the compressed sensing matching pursuit algorithm or CoSaMP for short.
A related convex formulation we have already seen and can be applied here is \[
\check{s} = \operatorname{argmin}_s \| CBs - y\|_2 + \lambda_1 \|s\|_1
\] compare Chapter 7 where \(\lambda\) is a weight for the importance of sparsity, see Figure 7.1.
Chapter 7 actually serve as our first example and shows neatly how the \(\ell_1\) norm promotes sparsity whereas the \(\ell_2\) solution stays dense. We can also use this example to give an insight to what with high probability means. We created a random matrix in this example with 5 times more rows than columns, unless we are very unlucky and we have a lot of linear dependency between the rows we will have infinitely many solution with high probability.
Example 13.1 (Recovering an Audio Signal from Sparse Measurements) Let us consider a audio signal consisting of a two-tone audio signal \[
f(t) = \cos(97 t\, 2\pi) + \cos(777 t\, 2\pi)
\] which is sparse in the frequency domain (transformation via FFT). The highest frequency is \(777 \mathrm{Hz}\) and therefore the Shannon-Nyquist theorem tells us we should sample with at least \(1554 \mathrm{Hz}\).
The sparsity of the sample allows us to reconstruct the signal by sampling randomly with an average sampling rate of \(128 \mathrm{Hz}\), well below the Shannon-Nyquist threshold.
The code relies on the CoSaMP implementation Needell and Tropp (2009) provided on gitHub and included below so we can work with it.
import numpy as npdef cosamp(phi, u, s, epsilon=1e-10, max_iter=1000):""" Return an `s`-sparse approximation of the target signal Input: - phi, sampling matrix - u, noisy sample vector - s, sparsity """ a = np.zeros(phi.shape[1]) v = u it =0# count halt =Falsewhilenot halt: it +=1 y = np.dot(np.transpose(phi), v)# large components omega = np.argsort(y)[-(2*s):] omega = np.union1d(omega, a.nonzero()[0]) phiT = phi[:, omega] b = np.zeros(phi.shape[1])# Solve Least Square b[omega], _, _, _ = np.linalg.lstsq(phiT, u)# Get new estimate b[np.argsort(b)[:-s]] =0 a = b# Halt criterion v_old = v v = u - np.dot(phi, a) halt = (np.linalg.norm(v - v_old) < epsilon) or\ np.linalg.norm(v) < epsilon or\ it > max_iterreturn a
Show the code for the figure
import matplotlib.pyplot as pltimport numpy as npfrom scipy.fftpack import dct, idct%config InlineBackend.figure_formats = ["svg"]np.random.seed(6020)N =4096t = np.linspace(0, 1, N)fun =lambda t: np.cos(97* t *2* np.pi) + np.cos(777* t *2* np.pi)f = fun(t)f_hat = np.fft.fft(f)PSD = np.abs(f_hat)**2/ N## Randomly sample signal# num. random samplesp = N //32perm = np.floor(np.random.rand(p) * N).astype(int)y = f[perm]## Solve compressed sensing problem# Build PsiPsi = dct(np.identity(N))# Measure rows of PsiTheta = Psi[perm, :]# CS via matching pursuits = cosamp(Theta, y, s=10, epsilon=1.e-10, max_iter=10)frec = idct(s)# computes the (fast) discrete fourier transformfrec_hat = np.fft.fft(frec, N)# Power spectrum (how much power in each freq)PSDrec = np.abs(frec_hat)**2/ Ntime_window = np.array([512, 768]) / Nfreq = np.arange(N)L =int(np.floor(N /2))plt.figure()plt.plot(t, f, "k")plt.plot(perm/N, y, "rx", ms=12, mew=4)plt.xlabel("Time [s]")plt.xlim(time_window[0], time_window[1])plt.ylim(-2, 2)plt.gca().set_aspect(5e-3)plt.figure()plt.plot(t, frec, "r")plt.xlim(time_window[0], time_window[1])plt.xlabel("Time [s]")plt.ylim(-2, 2)plt.gca().set_aspect(5e-3)plt.figure()plt.plot(freq[:L], PSD[:L], "k", label="Original")plt.plot(freq[:L], PSDrec[:L], "r", label="Reconstructed")plt.xlim(0, N/4)plt.ylim(0, N/4)plt.legend()plt.gca().set_aspect(1/3)plt.show()
(a) Time window of the signal, the x marks sampling points in the domain.
(b) Time window of the reconstructed signal from the above measurements.
(c) Power spectral density of the original and reconstructed signal.
Figure 13.1: Compressed sensing reconstruction of a two frequency audio signal, entire time from [0, 1].
The CoSaMP algorithm works in the discrete cosine transform basis, which can be derived from the FFT. For this algorithm we sample \(p=128\) points in time from the \(N = 4096\) high resolution sampling. We therefore know exactly the timing of these samples. If we would sample uniformly in time we fail, see Figure 13.2. This happens because we see an aliasing effect for the high-frequency domain resulting in erroneous frequency peaks. If we compare the samples in Figure 13.1 (a) and Figure 13.2 (a) it is hard to see the difference.
(a) Time window of the signal, the x marks sampling points in the domain.
(b) Time window of the reconstructed signal from the above measurements.
(c) Power spectral density of the original and reconstructed signal.
Figure 13.2: Compressed sensing reconstruction of a two frequency audio signal, entire time from [0, 1] - sampling uniform in time.
In the algorithm CoSaMP the desired level of sparsity can be selected. For the above plots we used \(s=10\) to simulate an unknown factor. Convergence to the sparsest solution relies on \(p\), which in return is related to the sparsity \(K\).
Now that we have seen compressed sensing in action we need to bring in the theoretical basics that form this theory.
The key feature is to look into the geometry of sparse vectors and how these vectors are transformed via random measurements. More precisely, for large enough \(p\) (amount of measurements) our matrix \(D = CB\) of Definition 13.1 preserves the distance and inner product structure of sparse vectors. In turn this means, we need to find a matrix \(C\) such that \(D\) is a near-isometry map on sparse vectors.
Note
An isometry map is a map that is distance preserving, i.e. the distance between to points is not changed under this map.
\[
\| a - b \| = \| f(a) - f(b) \|
\]
A unitary map is a map that preserves distance and angle between vectors, i.e. \[
\langle a, b \rangle = \langle f(a), f(b) \rangle
\] For a linear map \(f(x) = Ux\) this results in \(U^\mathrm{H}U = UU^\mathrm{H} = I\).
If \(D\) behaves as a near isometry, it is possible to solve \[
y = D s
\] for the sparsest vector \(s\) using convex \(\ell_1\) minimization.
Furthermore, a general rule is, the more incoherent the measurements are the smaller we can choose \(p\).
Definition 13.2 (Restricted Isometry Property (RIP)) For \(p\) incoherent measurements the matrix \(D=CB\) satisfies a restricted isometry property (RIP) for sparse vectors \(s\)
\[
(1-\delta_K)\|s\|_2^2 \leq \|C B s\|_2^2 \leq (1+\delta_K)\|s\|_2^2,
\] with the restricted isometry constant \(\delta_K\). For a small enough \(\delta_K\)\(CB\) acts as a near-isometry on \(K\)-sparse vectors.
In particular, for \(\delta_K < 1\) and therefore \((1-\delta_K) > 0\) and \((1+\delta_K) > 0\) this means the norm induced by \(CB\) is equivalent to the two norm on the \(K\)-sparse vectors.
It is difficult to compute the \(\delta_K\) in the RIP and as \(C\) may be selected at random there is more information included in the statistical properties of \(\delta_K\) for a family of matrices \(C\). In general, increasing \(p\) will decrease \(\delta_K\), same as having incoherence vectors and both improve the properties of \(CB\) by bringing it closer to an isometry.
Luckily, there exist generic sampling matrices \(C\) that are sufficiently incoherent with respect to nearly all transform bases. In particular, Gaussian and Bernoulli random measurement matrices satisfy Definition 13.2 for a generic \(B\) (with high probability).
13.2 Sparse Regression
Let us return to regression for a moment an in particular to the LASSO method introduced in Section 7.1.1. We have seen that \(\ell_1\), i.e. the \(\|\cdot\|_1\), promotes sparsity and we can use this to create a more robust method that rejects outliers.
We split up our points into a training and test set (the classic 80:20 split). We vary the parameter \(\lambda_1\) through a range of values to create a fit with the training set and test against the test set.
Example 13.2 (LASSO fit for a regression problem) For our \[
A x = b
\] problem, we prescribe the dimensions as \(200\) observations with \(10\) candidate predictions. This results in a matrix \(A\in\mathbb{R}^{200 \times 10}\) and we select the vector \(b\in\mathbb{R}^{200}\) as the linear combination of exactly two of theses \(10\) candidates. As a result, the vector \(x\) is \(2\)-sparse by construction, and the aim is to recover this. In order to give the algorithm something to work on we add noise to \(b\) resulting in no zero element in \(b\).
To perform a \(10\) fold cross validation 1 for the LASSO method we use the features of sklearn2.
Show the code for the figure
import numpy as npimport matplotlib.pyplot as pltfrom sklearn import linear_modelfrom sklearn import model_selectionimport pandas as pd%config InlineBackend.figure_formats = ["svg"]np.random.seed(6020)m =200n =10A = np.random.randn(m, n)x = np.array([0, 0, 1, 0, 0, 0, -1, 0, 0, 0])b = A @ x +2* np.random.randn(m)# k-cross validation for the Lasso methodcv =10lassoCV = linear_model.LassoCV(cv=cv, random_state=6020, tol=1e-8)lassoCV.fit(A, b)# Recompute the lasso method for the optimal lambda selectedlasso_best = linear_model.Lasso(alpha=lassoCV.alpha_, random_state=6020)lasso_best.fit(A, b)# Plottingplt.figure()Lmean = lassoCV.mse_path_.mean(axis=-1)error = [np.min(lassoCV.mse_path_, axis=-1), np.max(lassoCV.mse_path_, axis=-1)] / np.sqrt(cv)plt.errorbar(lassoCV.alphas_, Lmean, yerr=error, ecolor="lightgray")plt.plot(lassoCV.alpha_, Lmean[lassoCV.alphas_==lassoCV.alpha_],"go", mfc='none')plt.xscale("log")plt.ylabel("Means Square Error")plt.xlabel(r"$\lambda_1$")plt.gca().invert_xaxis()plt.show()
Figure 13.3: Cross-validation mean square error of Lasso fit, the green circle marks the optimal choice, the blue line the mean error of the k-folds and the error bar the maximal and minimal error form the k-folds.
The resulting coefficients for the best fit with lasso are
It is also possible to build a dictionary and and look for a sparse representation in this dictionary.
An example how this can be done with the Eigenfaces example can be found in (Brunton and Kutz 2022, 117)
13.3 Robust Principal Component Analysis (RPCA)
We discussed the principal component analysis in Section 4.2 as an application for the SVD. Similar as regression, also the PCA is sensitive to outliers and corrupt data. In Candès et al. (2011) an algorithm to make it more robust was developed.
The main idea of the paper is to decompose a matrix \(X\) into a structured low-rank matrix \(L\) and a sparse matrix \(S\) containing the outliers and corrupt data \[
X = L + S.
\] If we can recover the principal components of \(L\) from \(X\) we have a robust method as the perturbation of \(S\) has little influence.
It might not be immediately obvious but applications of this split are video surveillance where the background is represented by \(L\) and the foreground objects in \(L\), face recognition with the eigenfaces in \(L\) and shadows, occlusions (like glasses or masks) are in \(S\).
The task boils down to an optimization problem of the form \[
\min_{L, S} \operatorname{rank}(L) + \| S \|_0 \quad\text{subject to}\quad L+S = X,
\tag{13.2}\] where unfortunately both parts are not convex but we can search for it with high probability using \[
\min_{L, S} \|L\|_{\star} + \lambda \| S \|_1 \quad\text{subject to}\quad L+S = X.
\tag{13.3}\] In Equation 13.3\(\|\cdot\|_{\star}\) is called the nuclear norm consisting of the sum of all singular values and we use it as our proxy for the \(\operatorname{rank}\).
\(\lambda = 1/\sqrt{\max(n,m)}\) for \(X^{m \times n}\),
\(L\) is not sparse,
\(S\) is not low-rank, where we assume that the entries do not span a low-dimensional column space.
We can solve the PCP with an augmented Lagrange multiplier algorithm such as \[
\mathcal{L}(L, S, Y) = \|L\|_{\star} + \lambda \| S \|_1 + \langle Y, X - L - S \rangle + \frac{\mu}{2} \|X-L-S\|_F^2.
\]
This is an iterative method where we solve for \(L_k\) and \(S_k\), update \(Y_{k+1} = Y_k + \mu (X-L_k - S_k)\) and check for convergence. For this specific problem the alternations method (ADM) provides a simple procedure to solve for \(L\) and \(S\).
Example 13.3 (Robust principal component analysis with Yale B dataset) The following implementation of RPCA can be found in (Brunton and Kutz 2022, 121) with the same application to the eigenfaces dataset.
Robust principal component analysis and helper functions
def shrink(X, tau): Y = np.abs(X) - taureturn np.sign(X) * np.maximum(Y, np.zeros_like(Y))def SVT(X, tau): U, S, VT = np.linalg.svd(X, full_matrices=False) out = U @ np.diag(shrink(S, tau)) @ VTreturn outdef RPCA(X): n1, n2 = X.shape mu = n1 * n2 / (4* np.sum(np.abs(X.reshape(-1)))) lambd =1/ np.sqrt(np.maximum(n1, n2)) thresh =10**(-7) * np.linalg.norm(X) S = np.zeros_like(X) Y = np.zeros_like(X) L = np.zeros_like(X) count =0while (np.linalg.norm(X - L - S) > thresh) and (count <1000): L = SVT(X - S + (1/ mu) * Y, 1/ mu) S = shrink(X - L + (1/ mu) * Y, lambd / mu) Y = Y + mu*(X - L - S) count +=1return L, S
Figure 13.4: RPCA decomposition for the Yale B dataset.
In Figure 13.4 we can see that RPCA effectively filters out occluded regions and shadows form \(X\) in \(L\). The missing part is filled in with the most consistent low-rank feature from the provided dataset. In our case we include all faces from the first person plus the third image with a fake moustache.
We need to stress, that we can not use this setup to reconstruct a different face obscured by a moustache.
13.4 Sparse Sensor Placement
So far we have looked how to reconstruct a signal with random measurements in a generic basis. But how about placing sensors at the correct points to reconstruct the signal with high probability. This can dramatically reduce the amount of data to measure.
We can set tailored sensors for a particular library (our basis) instead of random sensors in a generic library.
Amongst other things, this can be used to reconstruct faces, or classify signals, see (Brunton and Kutz 2022, chap. 3.8) and references within.
To see this in action we use the Python package PySensors and the example Sea Surface Temperature (SST) sensors from their documentation, see the docs accessed on the 28th of November 2024.
Example 13.4 (Sea Surface Temperature (SST) sensors) For a given dataset of sea surface temperature as training data we would like to place (sparse) sensors optimal that allow us to reconstruct the temperature at any other location. This is achieved with the SSPOR algorithm from Manohar et al. (2018).
This concludes our excursion into compressed sensing.
Brunton, Steven L., and J. Nathan Kutz. 2022. Data-Driven Science and Engineering - Machine Learning, Dynamical Systems, and Control. 2nd ed. Cambridge: Cambridge University Press.
Candès, Emmanuel J., Xiaodong Li, Yi Ma, and John Wright. 2011. “Robust Principal Component Analysis?”J. ACM 58 (3). https://doi.org/10.1145/1970392.1970395.
Manohar, Krithika, Bingni W. Brunton, J. Nathan Kutz, and Steven L. Brunton. 2018. “Data-Driven Sparse Sensor Placement for Reconstruction: Demonstrating the Benefits of Exploiting Known Patterns.”IEEE Control Systems Magazine 38 (3): 63–86. https://doi.org/10.1109/MCS.2018.2810460.
Needell, D., and J. A. Tropp. 2009. “CoSaMP: Iterative Signal Recovery from Incomplete and Inaccurate Samples.”Applied and Computational Harmonic Analysis 26 (3): 301–21. https://doi.org/https://doi.org/10.1016/j.acha.2008.07.002.
We split our data into \(10\) sets, use 9 for the training and 1 for test and average over the results.↩︎