2  Data sets

We continue our little introduction by looking at data sets in the sense of a list of values that we want to describe closer.

We use the mieten3.asc from Open Data LMU. The data set contains information about rents in Munich from the year 2003. The columns have the following meaning, see DETAILS:

Variablenbeschreibung:

nm          Nettomiete in EUR
nmqm        Nettomiete pro m² in EUR
wfl         Wohnfläche in m²
rooms       Anzahl der Zimmer in der Wohnung
bj          Baujahr der Wohnung
bez         Stadtbezirk
wohngut     Gute Wohnlage? (J=1,N=0)
wohnbest    Beste Wohnlage? (J=1,N=0)
ww0         Warmwasserversorgung vorhanden? (J=0,N=1)
zh0         Zentralheizung vorhanden? (J=0,N=1)
badkach0    Gekacheltes Badezimmer? (J=0,N=1)
badextra    Besondere Zusatzausstattung im Bad? (J=1,N=0)
kueche      Gehobene Küche? (J=1,N=0)

For now, we’ll just show the code without much explanation because we want to jump right in and do not want to delve into how it works. We use a structured array of numpy for it.

import numpy as np
import requests
import io

response = requests.get("https://data.ub.uni-muenchen.de/2/1/miete03.asc")

# Transform the content of the file into a numpy.ndarray
data = np.genfromtxt(io.BytesIO(response.content), names=True)
# Access the data types and names
print(f"{data.dtype = }")
# Access the second element (row)
print(f"{data[1] = }")
# Access a name
print(f'{data["rooms"] = }')
data.dtype = dtype([('nm', '<f8'), ('nmqm', '<f8'), ('wfl', '<f8'), ('rooms', '<f8'), ('bj', '<f8'), ('bez', '<f8'), ('wohngut', '<f8'), ('wohnbest', '<f8'), ('ww0', '<f8'), ('zh0', '<f8'), ('badkach0', '<f8'), ('badextra', '<f8'), ('kueche', '<f8')])
data[1] = np.void((715.82, 11.01, 65.0, 2.0, 1995.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), dtype=[('nm', '<f8'), ('nmqm', '<f8'), ('wfl', '<f8'), ('rooms', '<f8'), ('bj', '<f8'), ('bez', '<f8'), ('wohngut', '<f8'), ('wohnbest', '<f8'), ('ww0', '<f8'), ('zh0', '<f8'), ('badkach0', '<f8'), ('badextra', '<f8'), ('kueche', '<f8')])
data["rooms"] = array([2., 2., 3., ..., 3., 1., 3.])

Now that we have some data we can look at it more closely, for this we interpret a row as a vector.

2.1 Basic properties of a data set

First we are looking at the total net rent, i.e. the row nm.

For a vector \(v \in \mathbb{R}^n\) we have:

  • the maximal value, i.e. the maximum \[ v^{max} = \max_i v_i, \]
  • the minimal value, i.e. the minimum \[ v^{min} = \min_i v_i, \]
  • the mean of all values (often called the arithmetic mean) \[ \overline{v} = \frac1n \sum_{i=1}^n v_i = \frac{v_1 + v_2 + \cdots + v_n}{n}, \]
  • the median, i.e. the value where half of all the other values are bigger and the other half is smaller, for a sorted \(v\) this is \[ \widetilde{v} = \begin{cases} v_{(n+1)/2}& n\quad \text{odd}\\ \frac{v_{n/2} + v_{n/2+1}}{2}& n\quad \text{even} \end{cases}, \]
  • more general, we have quantiles. For a sorted \(v\) and \(p\in(0,1)\) \[ \overline{v}_p = \begin{cases} \frac12\left(v_{np} + v_{np+1}\right) & pn \in \mathbb{N}\\ v_{\lfloor np+1\rfloor} & pn \not\in \mathbb{N} \end{cases}. \] Some quantiles have special names, like the median for \(p=0.5\), the lower and upper quartile for \(p=0.25\) and \(p=0.75\) (or first, second (median) and third quartile), respectively.
nm_max = np.max(data["nm"])
print(f"{nm_max = }")

nm_min = np.min(data["nm"])
print(f"{nm_min = }")

nm_mean = np.mean(data["nm"])
# round to 2 digits
nm_mean_r = np.around(nm_mean, 2)
print(f"{nm_mean_r = }")

nm_median = np.median(data["nm"])
print(f"{nm_median = }")

nm_quartiles = np.quantile(data["nm"], [1 / 4, 1 / 2, 3 / 4])
print(f"{nm_quartiles = }")
nm_max = np.float64(1789.55)
nm_min = np.float64(77.31)
nm_mean_r = np.float64(570.09)
nm_median = np.float64(534.3)
nm_quartiles = array([389.95, 534.3 , 700.48])

From this Python snippet we know that for tenants the rent varied between 77.31 and 1789.55, with an average of 570.09 and a median of 534.3. Of course there are tricky questions that require us to dig a bit deeper into these functions, e.g. how many rooms does the most expensive flat have? The surprising answer is 3 and it was built in 1994, but how do we obtain these results?

We can use numpy.argwhere or a function which returns the index directly like numpy.argmax.

max_index = np.argmax(data["nm"])
rooms = int(data["rooms"][max_index])
year = int(data["bj"][max_index])
print(f"{rooms = }, {year = }")
rooms = 3, year = 1994

2.1.1 Visualization

Tip

There are various ways of visualizing data in Python. Two widely used packages are matplotlib and plotly.

It often helps to visualize the values to see differences and get an idea of their use.

Show the code for the figure
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

nm_sort = np.sort(data["nm"])
x = np.linspace(
    0,
    1,
    len(nm_sort),
    endpoint=True,
)

plt.plot(x, nm_sort, label="net rent")
plt.axis((0, 1, np.round(nm_min / 100) * 100, np.round(nm_max / 100) * 100))
plt.xlabel("Scaled index")
plt.ylabel("Net rent - nm")

plt.plot(
    [0, 0.25, 0.25], [nm_quartiles[0], nm_quartiles[0], nm_min], label="1st quartile"
)
plt.plot(
    [0, 0.5, 0.5], [nm_quartiles[1], nm_quartiles[1], nm_min], label="2st quartile"
)
plt.plot(
    [0, 0.75, 0.75], [nm_quartiles[2], nm_quartiles[2], nm_min], label="3st quartile"
)
plt.plot([0, 1], [nm_mean, nm_mean], label="mean")
plt.legend()
plt.show()
Figure 2.1: Visualization of the different measurements.

What is shown in Figure 2.1 is often combined into a single boxplot (see Figure 2.2) that provides way more information at once.

Show the code for the figure
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Box(y=data["nm"], name="Standard"))
fig.add_trace(go.Box(y=data["nm"], name="With points", boxpoints="all"))
fig.show()
Figure 2.2: Boxplot done in plotly with whiskers following 3/2 IQR.

The plot contains the box which is defined by the 1st quantile \(Q_1\) and the 3rd quantile \(Q_3\), with the median as line in between these two. Furthermore, we can see the whiskers which help us identify so called outliers. By default they are defined as \(\pm 1.5(Q_3 - Q_1)\), where (\(Q_3 - Q_1\)) is often called the interquartile range (IQR).

Note

Figure 2.2 is an interactive plot in the html version.

2.2 Spread

The spread (or dispersion, variability, scatter) are measures used in statistics to classify how data is distributed. Common examples are variance, standard deviation, and the interquartile range that we have already seen above.

Definition 2.1 (Variance) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the variance is defined as \[ \operatorname{Var}(v) = \frac1n \sum_{i=1}^n (v_i - \mu)^2, \quad \mu = \overline{v} \quad\text{(the mean)} \] or directly \[ \operatorname{Var}(v) = \frac{1}{n^2} \sum_{i=1}^n\sum_{j>i} (v_i - v_j)^2. \]

Definition 2.2 (Standard deviation) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the standard deviation is defined as \[ \sigma = \sqrt{\frac1n \sum_{i=1}^n (v_i - \mu)^2}, \quad \mu = \overline{v} \quad\text{(the mean)}. \] If we interpret \(v\) as a sample this is often also called uncorrected sample standard deviation.

Definition 2.3 (Interquartile range (IQR)) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the interquartile range is defined as the difference of the first and third quartile, i.e. \[ IQR = \overline{v}_{0.75} - \overline{v}_{0.25}. \]

With numpy they are computed as follows

nm_var = np.var(data["nm"])
print(f"{nm_var = }")

nm_std = np.std(data["nm"])
print(f"{nm_std = }")

nm_IQR = nm_quartiles[2] - nm_quartiles[0]
print(f"{nm_IQR = }")
nm_var = np.float64(60208.75551600402)
nm_std = np.float64(245.37472468859548)
nm_IQR = np.float64(310.53000000000003)

2.3 Histogram

When exploring data it is also quite useful to draw histograms. For the net rent this makes not much sense but for rooms this is useful.

Show the code for the figure
plt.hist(data["rooms"])
plt.xlabel("rooms")
plt.ylabel("# of rooms")
plt.show()
Figure 2.3: Histogram of the number of rooms in our dataset.

What we see in Figure 2.3 is simply the amount of occurrences of \(1\) to \(6\) in our dataset. Already we can see something rather interesting, there are flats with \(5.5\) rooms in our dataset.

Another helpful histogram is Figure 2.4 showing the amount of buildings built per year.

Show the code for the figure
plt.hist(data["bj"])
plt.xlabel("year of building")
plt.ylabel("# of buildings")
plt.show()
Figure 2.4: Histogram of buildings built per year.

2.4 Correlation

In statistics, the terms correlation or dependence describe any statistical relationship between bivariate data (data that is paired) or random variables.

For our dataset we can, for example, check:

  1. the living area in \(m^2\) - wfl vs. the net rent - nm
  2. the year of construction - bj vs. if central heating - zh0 is available
  3. the year of construction - bj vs. the city district - bez
Show the code for the figure
from plotly.subplots import make_subplots

fig = make_subplots(rows=3, cols=1)

fig.add_trace(go.Scatter(x=data["wfl"], y=data["nm"], mode="markers"), row=1, col=1)
fig.update_xaxes(title_text="living area in m^2", row=1, col=1)
fig.update_yaxes(title_text="net rent", row=1, col=1)

fig.add_trace(go.Scatter(x=data["bj"], y=data["zh0"], mode="markers"), row=2, col=1)
fig.update_xaxes(title_text="year of construction", row=2, col=1)
fig.update_yaxes(title_text="central heating", row=2, col=1)

fig.add_trace(go.Scatter(x=data["bj"], y=data["bez"], mode="markers"), row=3, col=1)
fig.update_xaxes(title_text="year of construction", row=3, col=1)
fig.update_yaxes(title_text="city district", row=3, col=1)

fig.show()
Figure 2.5: Scatterplot to investigate correlations in the data set.

In the first plot of Figure 2.5 we see that the rent tends to go up with the size of the flat but there are for sure some rather cheap options in terms of space.

The second plot of Figure 2.5 tells us that central heating became a constant around \(1966\). Of course we can also guess that the older buildings with central heating were renovated, but we have no data to support this claim.

The third plot of Figure 2.5 does not yield an immediate correlation.

More formally, we can describe possible correlations using the covariance. The covariance is a measure of the joint variability of two random variables.

Definition 2.4 (Covariance) For two finite sets represented by vectors \(v, w\in\mathbb{R}^n\) the covariance is defined as \[ \operatorname{cov}(v, w) = \frac1n \langle v -\overline{v}, w - \overline{w}\rangle. \]

The covariance is tricky to interpret, e.g. the unities of the two must not make sense. In the example below, we have rent per square meter, which makes some sense.

From the covariance we can compute the correlation.

Definition 2.5 (Correlation) For two finite sets represented by vectors \(v, w\in\mathbb{R}^n\) the correlation is defined as \[ \rho_{v,w} = \operatorname{corr}(v, w) = \frac{\operatorname{cov}(v, w)}{\sigma_v \sigma_w}, \] where \(\sigma_v\) and \(\sigma_w\) are the standard deviation of these vectors, see Definition 2.2.

In numpy the function numpy.cov computes a matrix where the diagonal is the variance of the values and the off-diagonals are the covariances of the \(i\) and \(j\) samples. Consequently, numpy.corrcoef is a matrix as well.

cov_nm_wfl = np.cov(data["nm"], data["wfl"])
print(f"{cov_nm_wfl[0, 1] = }")

corr_nm_wfl = np.corrcoef(data["nm"], data["wfl"])
print(f"{corr_nm_wfl[0, 1] = }")
cov_nm_wfl[0, 1] = np.float64(4369.1195844122)
corr_nm_wfl[0, 1] = np.float64(0.7074626685750687)

The above results, particularly \(\rho_{\text{nm},\text{wfl}}=0.707\) suggest that the higher the rent, the more space you get.

Tip

Correlation and causation are not the same thing!

Tip

We showed some basic tests for correlation, there are more elaborate methods but they are subject to a later chapter.