13 Compressed Sensing

We use compression in everyday life extensively. Nevertheless, it almost always requires us to have the high resolution data/measurement and than compress it down. Here we discuss the reverse, where we collect relatively few compressed or random measurements and then infer a sparse representation.

Until recent developments the computation was non-polynomial (NP) hard problem but now there are algorithms to reconstruct the full signal with high probability using convex algorithms.

Definition 13.1 (Compressed Sensing) For a signal \(x\) that is \(K\)-sparse in \(\mathcal{B}\) with matrix \(B\) we need \(n\) measurements and can compress them to \(K\).

In compressed sensing we collect \(p\) randomly chosen or compressed measurements and solve for the \(K\) elements of \(s\) that are not zero, with \(K < p \ll n\).

We can represent \(p\) linear measurements of \(x\) as the matrix \(C\in\mathbb{R}^{p\times n}\) and thus \[ y = C x. \]

Relating this back to \(s\) and \(\mathcal{B}\) we get \[ y = C x = C B s = D s, \] as an under-determined system. To get the optimal solution, i.e. the smallest \(K\), we optimize \[ \check{s} = \operatorname{argmin}_s \| s \|_0, \quad \text{subject to} \quad y = C B s. \] Here \(\|\cdot\|_0\) is \(\ell_0\) pseudo norm that measures the number of non-zero elements.

As we will see, the selection of the matrix \(C\) is important to allow for the optimization to be such that we do not need brute-force to look for \(\check{s}\). This brute-force search is combinatorial in \(n\) and \(K\), for a fixed (known) \(K\) but significantly larger for unknown \(K\).

Under certain condition on \(C\) we can relax the conditions to a convex \(\ell_1\)-minimization \[ \check{s} = \operatorname{argmin}_s \| s \|_1, \quad \text{subject to} \quad y = C B s. \tag{13.1}\]

In order to have Equation 13.1 converge with high properbility we need to following assumptions to be met (precise description will follow):

The measurement matrix \(C\) must be incoherent with respect to the sparsifying basis \(\mathcal{B}\), i.e. the rows of \(C\) do not correlate with the columns of \(B\).
The number of measurements \(p\) must be sufficiently large, on the order of \[ p \approx \mathcal{O}\left(K\log\left(\frac{n}{K}\right)\right) \approx k_1\, K \log\left(\frac{n}{K}\right), \] where \(k_1\) depends on how incoherent \(C\) and \(B\) are.

Note

In Figure 12.1 we illustrated compressed sensing with a \(4016 \times 6016\) image. With our 5% of Fourier coefficients this puts the sparsity \(K = 0.05 \times 4016 \times 6016 \approx 1 208 013\) and with \(k_1=3\) we would stay with \(p\) 3.6 million measurements and about 15% of the original pixels.

The trick part is how to select them, which is why compressed sensing is not used much in imaging, except for some cases like magnetic resonance imaging (MRI) for patience that can not stay still or sedation is risky.

The idea of the two conditions is to make \(CB\) act more or less like a unitary transformation on \(K\)-sparse vectors \(s\), preserving relative distance between vectors and allowing for the \(\ell_1\) convex optimization. We will see this in therms of the restricted isometry property (RIP).

We know of the Shannon-Nyquist sampling theorem:

If a function \(f(t)\) contains no frequencies higher than \(b\) hertz, then it can be completely determined from its ordinates at a sequence of points spaced less than \(\frac{1}{2b}\) seconds apart. - see Wikipedia

telling us that we need a sampling rate of at least double the highest frequency to recover the signal properly. However, this is only true for signals with broadband frequency content and this is hardly ever the case for uncompressed signals. As we expect an uncompressed signal can be expressed as a sparse signal in the correct basis we can relax the Shannon-Nyquist theorem. Nevertheless, we need precise timing for our measurements and the recovery is not guaranteed, it is possible with high probability.

Note

There are alternative formulations based on so called greedy algorithms that determine the sparsity through an iterative matching pursuit problem, e.g. the compressed sensing matching pursuit algorithm or CoSaMP for short.

A related convex formulation we have already seen and can be applied here is \[ \check{s} = \operatorname{argmin}_s \| CBs - y\|_2 + \lambda_1 \|s\|_1 \] compare Chapter 7 where \(\lambda\) is a weight for the importance of sparsity, see Figure 7.1.

Chapter 7 actually serve as our first example and shows neatly how the \(\ell_1\) norm promotes sparsity whereas the \(\ell_2\) solution stays dense. We can also use this example to give an insight to what with high probability means. We created a random matrix in this example with 5 times more rows than columns, unless we are very unlucky and we have a lot of linear dependency between the rows we will have infinitely many solution with high probability.

Example 13.1 (Recovering an Audio Signal from Sparse Measurements) Let us consider a audio signal consisting of a two-tone audio signal \[ f(t) = \cos(97 t\, 2\pi) + \cos(777 t\, 2\pi) \] which is sparse in the frequency domain (transformation via FFT). The highest frequency is \(777 \mathrm{Hz}\) and therefore the Shannon-Nyquist theorem tells us we should sample with at least \(1554 \mathrm{Hz}\).

The sparsity of the sample allows us to reconstruct the signal by sampling randomly with an average sampling rate of \(128 \mathrm{Hz}\), well below the Shannon-Nyquist threshold.

The code relies on the CoSaMP implementation Needell and Tropp (2009) provided on gitHub and included below so we can work with it.

The code is adapted from (Brunton and Kutz 2022, Code 3.3)

CoSaMP algorithm form gitHub, now print.

import numpy as np

def cosamp(phi, u, s, epsilon=1e-10, max_iter=1000):
    """
    Return an `s`-sparse approximation of the target signal
    Input:
        - phi, sampling matrix
        - u, noisy sample vector
        - s, sparsity
    """
    a = np.zeros(phi.shape[1])
    v = u
    it = 0 # count
    halt = False
    while not halt:
        it += 1
        
        y = np.dot(np.transpose(phi), v)
        # large components
        omega = np.argsort(y)[-(2*s):]
        omega = np.union1d(omega, a.nonzero()[0])
        phiT = phi[:, omega]
        b = np.zeros(phi.shape[1])
        # Solve Least Square
        b[omega], _, _, _ = np.linalg.lstsq(phiT, u)
        
        # Get new estimate
        b[np.argsort(b)[:-s]] = 0
        a = b
        
        # Halt criterion
        v_old = v
        v = u - np.dot(phi, a)

        halt = (np.linalg.norm(v - v_old) < epsilon) or \
            np.linalg.norm(v) < epsilon or \
            it > max_iter
        
    return a

Show the code for the figure

import matplotlib.pyplot as plt
import numpy as np
from scipy.fftpack import dct, idct
%config InlineBackend.figure_formats = ["svg"]
np.random.seed(6020)

N = 4096
t = np.linspace(0, 1, N)
fun = lambda t: np.cos(97 * t * 2 * np.pi) + np.cos(777 * t * 2 * np.pi)
f = fun(t)
f_hat = np.fft.fft(f)
PSD = np.abs(f_hat)**2 / N

## Randomly sample signal
# num. random samples
p = N // 32
perm = np.floor(np.random.rand(p) * N).astype(int)
y = f[perm]

## Solve compressed sensing problem
# Build Psi
Psi = dct(np.identity(N))
# Measure rows of Psi
Theta = Psi[perm, :]

# CS via matching pursuit
s = cosamp(Theta, y, s=10, epsilon=1.e-10, max_iter=10)
frec = idct(s)
# computes the (fast) discrete fourier transform
frec_hat = np.fft.fft(frec, N)
# Power spectrum (how much power in each freq)
PSDrec = np.abs(frec_hat)**2 / N

time_window = np.array([512, 768]) / N
freq = np.arange(N)
L = int(np.floor(N / 2))

plt.figure()
plt.plot(t, f, "k")
plt.plot(perm/N, y, "rx", ms=12, mew=4)
plt.xlabel("Time [s]")
plt.xlim(time_window[0], time_window[1])
plt.ylim(-2, 2)
plt.gca().set_aspect(5e-3)

plt.figure()
plt.plot(t, frec, "r")
plt.xlim(time_window[0], time_window[1])
plt.xlabel("Time [s]")
plt.ylim(-2, 2)
plt.gca().set_aspect(5e-3)

plt.figure()
plt.plot(freq[:L], PSD[:L], "k", label="Original")
plt.plot(freq[:L], PSDrec[:L], "r", label="Reconstructed")
plt.xlim(0, N/4)
plt.ylim(0, N/4)
plt.legend()
plt.gca().set_aspect(1/3)

plt.show()

(a) Time window of the signal, the x marks sampling points in the domain.

The CoSaMP algorithm works in the discrete cosine transform basis, which can be derived from the FFT. For this algorithm we sample \(p=128\) points in time from the \(N = 4096\) high resolution sampling. We therefore know exactly the timing of these samples. If we would sample uniformly in time we fail, see Figure 13.2. This happens because we see an aliasing effect for the high-frequency domain resulting in erroneous frequency peaks. If we compare the samples in Figure 13.1 (a) and Figure 13.2 (a) it is hard to see the difference.

Show the code for the figure

ttt = np.random.uniform(0, 1, p)
y = fun(ttt)

## Solve compressed sensing problem
Psi = dct(np.identity(N))
Theta = Psi[perm, :]

# CS via matching pursuit
s = cosamp(Theta, y, s=10, epsilon=1.e-10, max_iter=10) 
frec = idct(s)
frec_hat = np.fft.fft(frec, N)
PSDrec = np.abs(frec_hat)**2 / N

plt.figure()
plt.plot(t, f, "k")
plt.plot(perm/N, y, "rx", ms=12, mew=4)
plt.xlabel("Time [s]")
plt.xlim(time_window[0], time_window[1])
plt.ylim(-2, 2)
plt.gca().set_aspect(5e-3)

plt.figure()
plt.plot(t, frec, "r")
plt.xlim(time_window[0], time_window[1])
plt.xlabel("Time [s]")
plt.ylim(-2, 2)
plt.gca().set_aspect(5e-3)

plt.figure()
plt.plot(freq[:L], PSD[:L], "k", label="Original")
plt.plot(freq[:L], PSDrec[:L], "r", label="Reconstructed")
plt.xlim(0, N/4)
plt.ylim(0, N/4)
plt.legend()
plt.gca().set_aspect(1/3)

In the algorithm CoSaMP the desired level of sparsity can be selected. For the above plots we used \(s=10\) to simulate an unknown factor. Convergence to the sparsest solution relies on \(p\), which in return is related to the sparsity \(K\).

(Compare Brunton and Kutz 2022, 107)

13.1 The theoretic basics of compressed sensing

Now that we have seen compressed sensing in action we need to bring in the theoretical basics that form this theory.

The key feature is to look into the geometry of sparse vectors and how these vectors are transformed via random measurements. More precisely, for large enough \(p\) (amount of measurements) our matrix \(D = CB\) of Definition 13.1 preserves the distance and inner product structure of sparse vectors. In turn this means, we need to find a matrix \(C\) such that \(D\) is a near-isometry map on sparse vectors.

Note

An isometry map is a map that is distance preserving, i.e. the distance between to points is not changed under this map.

\[ \| a - b \| = \| f(a) - f(b) \| \]

A unitary map is a map that preserves distance and angle between vectors, i.e. \[ \langle a, b \rangle = \langle f(a), f(b) \rangle \] For a linear map \(f(x) = Ux\) this results in \(U^\mathrm{H}U = UU^\mathrm{H} = I\).

If \(D\) behaves as a near isometry, it is possible to solve \[ y = D s \] for the sparsest vector \(s\) using convex \(\ell_1\) minimization.

Furthermore, a general rule is, the more incoherent the measurements are the smaller we can choose \(p\).

Definition 13.2 (Restricted Isometry Property (RIP)) For \(p\) incoherent measurements the matrix \(D=CB\) satisfies a restricted isometry property (RIP) for sparse vectors \(s\)

\[ (1-\delta_K)\|s\|_2^2 \leq \|C B s\|_2^2 \leq (1+\delta_K)\|s\|_2^2, \] with the restricted isometry constant \(\delta_K\). For a small enough \(\delta_K\) \(CB\) acts as a near-isometry on \(K\)-sparse vectors.

In particular, for \(\delta_K < 1\) and therefore \((1-\delta_K) > 0\) and \((1+\delta_K) > 0\) this means the norm induced by \(CB\) is equivalent to the two norm on the \(K\)-sparse vectors.

It is difficult to compute the \(\delta_K\) in the RIP and as \(C\) may be selected at random there is more information included in the statistical properties of \(\delta_K\) for a family of matrices \(C\). In general, increasing \(p\) will decrease \(\delta_K\), same as having incoherence vectors and both improve the properties of \(CB\) by bringing it closer to an isometry.

Luckily, there exist generic sampling matrices \(C\) that are sufficiently incoherent with respect to nearly all transform bases. In particular, Gaussian and Bernoulli random measurement matrices satisfy Definition 13.2 for a generic \(B\) (with high probability).

13.2 Sparse Regression

Let us return to regression for a moment an in particular to the LASSO method introduced in Section 7.1.1. We have seen that \(\ell_1\), i.e. the \(\|\cdot\|_1\), promotes sparsity and we can use this to create a more robust method that rejects outliers.

We split up our points into a training and test set (the classic 80:20 split). We vary the parameter \(\lambda_1\) through a range of values to create a fit with the training set and test against the test set.

Example 13.2 (LASSO fit for a regression problem) For our \[ A x = b \] problem, we prescribe the dimensions as \(200\) observations with \(10\) candidate predictions. This results in a matrix \(A\in\mathbb{R}^{200 \times 10}\) and we select the vector \(b\in\mathbb{R}^{200}\) as the linear combination of exactly two of theses \(10\) candidates. As a result, the vector \(x\) is \(2\)-sparse by construction, and the aim is to recover this. In order to give the algorithm something to work on we add noise to \(b\) resulting in no zero element in \(b\).

To perform a \(10\) fold cross validation ¹ for the LASSO method we use the features of sklearn ².

Show the code for the figure

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import model_selection
import pandas as pd
%config InlineBackend.figure_formats = ["svg"]
np.random.seed(6020)

m = 200
n = 10

A = np.random.randn(m, n)
x = np.array([0, 0, 1, 0, 0, 0, -1, 0, 0, 0])
b = A @ x + 2 * np.random.randn(m)

# k-cross validation for the Lasso method
cv = 10
lassoCV = linear_model.LassoCV(cv=cv, random_state=6020, tol=1e-8)
lassoCV.fit(A, b)
# Recompute the lasso method for the optimal lambda selected
lasso_best = linear_model.Lasso(alpha=lassoCV.alpha_, random_state=6020)
lasso_best.fit(A, b)

# Plotting
plt.figure()
Lmean = lassoCV.mse_path_.mean(axis=-1)
error = [np.min(lassoCV.mse_path_, axis=-1), 
         np.max(lassoCV.mse_path_, axis=-1)] / np.sqrt(cv)
plt.errorbar(lassoCV.alphas_, Lmean, yerr=error, ecolor="lightgray")
plt.plot(lassoCV.alpha_,
         Lmean[lassoCV.alphas_==lassoCV.alpha_],
         "go", mfc='none')
plt.xscale("log")
plt.ylabel("Means Square Error")
plt.xlabel(r"$\lambda_1$")
plt.gca().invert_xaxis()
plt.show()

Figure 13.3: Cross-validation mean square error of Lasso fit, the green circle marks the optimal choice, the blue line the mean error of the k-folds and the error bar the maximal and minimal error form the k-folds.

The resulting coefficients for the best fit with lasso are

array([ 0.        ,  0.        ,  1.0978465 ,  0.        , -0.        ,
       -0.        , -1.06495486,  0.        ,  0.        ,  0.        ])

and a appropriate fit for our toy example.

(Compare Brunton and Kutz 2022, 115)

Note

It is also possible to build a dictionary and and look for a sparse representation in this dictionary.

An example how this can be done with the Eigenfaces example can be found in (Brunton and Kutz 2022, 117)

13.3 Robust Principal Component Analysis (RPCA)

We discussed the principal component analysis in Section 4.2 as an application for the SVD. Similar as regression, also the PCA is sensitive to outliers and corrupt data. In Candès et al. (2011) an algorithm to make it more robust was developed.

The main idea of the paper is to decompose a matrix \(X\) into a structured low-rank matrix \(L\) and a sparse matrix \(S\) containing the outliers and corrupt data \[ X = L + S. \] If we can recover the principal components of \(L\) from \(X\) we have a robust method as the perturbation of \(S\) has little influence.

It might not be immediately obvious but applications of this split are video surveillance where the background is represented by \(L\) and the foreground objects in \(L\), face recognition with the eigenfaces in \(L\) and shadows, occlusions (like glasses or masks) are in \(S\).

The task boils down to an optimization problem of the form \[ \min_{L, S} \operatorname{rank}(L) + \| S \|_0 \quad\text{subject to}\quad L+S = X, \tag{13.2}\] where unfortunately both parts are not convex but we can search for it with high probability using \[ \min_{L, S} \|L\|_{\star} + \lambda \| S \|_1 \quad\text{subject to}\quad L+S = X. \tag{13.3}\] In Equation 13.3 \(\|\cdot\|_{\star}\) is called the nuclear norm consisting of the sum of all singular values and we use it as our proxy for the \(\operatorname{rank}\).

Equation 13.3 is called principal component pursuit (PCP) and in Candès et al. (2011) the authors show that Equation 13.3 converges to Equation 13.2 for

\(\lambda = 1/\sqrt{\max(n,m)}\) for \(X^{m \times n}\),
\(L\) is not sparse,
\(S\) is not low-rank, where we assume that the entries do not span a low-dimensional column space.

We can solve the PCP with an augmented Lagrange multiplier algorithm such as \[ \mathcal{L}(L, S, Y) = \|L\|_{\star} + \lambda \| S \|_1 + \langle Y, X - L - S \rangle + \frac{\mu}{2} \|X-L-S\|_F^2. \]

This is an iterative method where we solve for \(L_k\) and \(S_k\), update \(Y_{k+1} = Y_k + \mu (X-L_k - S_k)\) and check for convergence. For this specific problem the alternations method (ADM) provides a simple procedure to solve for \(L\) and \(S\).

Example 13.3 (Robust principal component analysis with Yale B dataset) The following implementation of RPCA can be found in (Brunton and Kutz 2022, 121) with the same application to the eigenfaces dataset.

Robust principal component analysis and helper functions

def shrink(X, tau):
    Y = np.abs(X) - tau
    return np.sign(X) * np.maximum(Y, np.zeros_like(Y))


def SVT(X, tau):
    U, S, VT = np.linalg.svd(X, full_matrices=False)
    out = U @ np.diag(shrink(S, tau)) @ VT
    return out


def RPCA(X):
    n1, n2 = X.shape
    mu = n1 * n2 / (4 * np.sum(np.abs(X.reshape(-1))))
    lambd = 1 / np.sqrt(np.maximum(n1, n2))
    thresh = 10**(-7) * np.linalg.norm(X)
    
    S = np.zeros_like(X)
    Y = np.zeros_like(X)
    L = np.zeros_like(X)
    count = 0
    while (np.linalg.norm(X - L - S) > thresh) and (count < 1000):
        L = SVT(X - S + (1 / mu) * Y, 1 / mu)
        S = shrink(X - L + (1 / mu) * Y, lambd / mu)
        Y = Y + mu*(X - L - S)
        count += 1
    return L, S

Show the code for the figure

import numpy as np
import numpy.linalg as LA
import scipy
import requests
import io
import imageio.v3 as iio
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ['svg']

def rgb2gray(rgb):
    return np.dot(rgb[...,:3], [0.2989, 0.5870, 0.1140])

response = requests.get(
    "https://github.com/frankhuettner/Data_Driven_Science_Julia_Demos"
    "/raw/refs/heads/main/DATA/allFaces.mat")

data = scipy.io.loadmat(io.BytesIO(response.content))
faces = data["faces"]
m = int(data["m"][0,0])
n = int(data["n"][0,0])
nfaces = np.ndarray.flatten(data['nfaces'])

XX = faces[:,:nfaces[0]]
im = np.asarray(iio.imread(
    "https://raw.githubusercontent.com/dynamicslab/databook_python"
    "/refs/heads/master/DATA/mustache.jpg"))
A = np.round(rgb2gray(im)/255).astype("uint8")
X = np.append(XX, (XX[:, 2] * A.T.flatten()).reshape((-1, 1)), axis=1)

L, S = RPCA(X)

for index in [2, 3, -1]:
    plt.figure()
    plt.imshow(np.reshape(X[:, index], (m, n)).T, cmap="gray")
    plt.gca().axis("off")
    plt.figure()
    plt.imshow(np.reshape(L[:, index], (m, n)).T, cmap="gray")
    plt.gca().axis("off")
    plt.figure()
    plt.imshow(np.reshape(S[:, index], (m, n)).T, cmap="gray")
    plt.gca().axis("off")

In Figure 13.4 we can see that RPCA effectively filters out occluded regions and shadows form \(X\) in \(L\). The missing part is filled in with the most consistent low-rank feature from the provided dataset. In our case we include all faces from the first person plus the third image with a fake moustache.

We need to stress, that we can not use this setup to reconstruct a different face obscured by a moustache.

13.4 Sparse Sensor Placement

So far we have looked how to reconstruct a signal with random measurements in a generic basis. But how about placing sensors at the correct points to reconstruct the signal with high probability. This can dramatically reduce the amount of data to measure.

We can set tailored sensors for a particular library (our basis) instead of random sensors in a generic library.

Amongst other things, this can be used to reconstruct faces, or classify signals, see (Brunton and Kutz 2022, chap. 3.8) and references within.

To see this in action we use the Python package PySensors and the example Sea Surface Temperature (SST) sensors from their documentation, see the docs accessed on the 28th of November 2024.

Example 13.4 (Sea Surface Temperature (SST) sensors) For a given dataset of sea surface temperature as training data we would like to place (sparse) sensors optimal that allow us to reconstruct the temperature at any other location. This is achieved with the SSPOR algorithm from Manohar et al. (2018).

Show the code for the figure

from ftplib import FTP
import numpy as np
import matplotlib.pyplot as plt
import netCDF4
import pysensors as ps
%config InlineBackend.figure_formats = ["svg"]
np.random.seed(6020)

# Import and save data locally
ftp = FTP('ftp.cdc.noaa.gov')
ftp.login()
ftp.cwd('/Datasets/noaa.oisst.v2/')

filenames = ['sst.wkmean.1990-present.nc', 'lsmask.nc']

for filename in filenames:
    localfile = open(filename, 'wb')
    ftp.retrbinary('RETR ' + filename, localfile.write, 1024)
    localfile.close()

ftp.quit()

f = netCDF4.Dataset('sst.wkmean.1990-present.nc')
lat,lon = f.variables['lat'], f.variables['lon']
SST = f.variables['sst']
sst = SST[:]

f = netCDF4.Dataset('lsmask.nc')
mask = f.variables['mask']

masks = np.bool_(np.squeeze(mask))
snapshot = float("nan")*np.ones((180,360))
snapshot[masks] = sst[0,masks]

plt.figure()
plt.imshow(snapshot, cmap=plt.cm.coolwarm)
plt.xticks([])
plt.yticks([])
X = sst[:,masks]
X = np.reshape(X.compressed(), X.shape)

# Compute optimal sensor placement
model = ps.SSPOR(
    basis=ps.basis.SVD(n_basis_modes=25),
    n_sensors=25
)
model.fit(X)
sensors = model.get_selected_sensors()

# Plot sensor locations
temp = np.transpose(0 * X[1,:])
temp[sensors] = 1
img = 0 * snapshot
img[masks] = temp
plt.figure()
plt.imshow(snapshot, cmap=plt.cm.coolwarm)
indx = np.where(img==1)
plt.scatter(indx[1], indx[0], 8, color='black')
plt.xticks([])
plt.yticks([])

(Compare docs of PySensors accessed on the 28th of November 2024)

This concludes our excursion into compressed sensing.

We split our data into \(10\) sets, use 9 for the training and 1 for test and average over the results.↩︎
Install the package via pdm add scikit-learn.↩︎