2  Data sets

We continue our little introduction by looking at data sets in the sense of a list of values that we want to describe closer.

We use the mieten3.asc from Open Data LMU. The data set contains information about rents in Munich from the year 2003. The columns have the following meaning, see DETAILS:

Variablenbeschreibung:

nm          Nettomiete in EUR
nmqm        Nettomiete pro m² in EUR
wfl         Wohnfläche in m²
rooms       Anzahl der Zimmer in der Wohnung
bj          Baujahr der Wohnung
bez         Stadtbezirk
wohngut     Gute Wohnlage? (J=1,N=0)
wohnbest    Beste Wohnlage? (J=1,N=0)
ww0         Warmwasserversorgung vorhanden? (J=0,N=1)
zh0         Zentralheizung vorhanden? (J=0,N=1)
badkach0    Gekacheltes Badezimmer? (J=0,N=1)
badextra    Besondere Zusatzausstattung im Bad? (J=1,N=0)
kueche      Gehobene Küche? (J=1,N=0)

For now, we’ll just show the code without much explanation because we want to jump right in and do not want to delve into how it works. We use a structured array of numpy for it.

import numpy as np
import requests
import io
response = requests.get("https://data.ub.uni-muenchen.de/2/1/miete03.asc")

# Transform the content of the file into a numpy.ndarray
data = np.genfromtxt(io.BytesIO(response.content), names=True)
# Access the data types and names
print(f"{data.dtype=}")
# Access the second element (row)
print(f"{data[1]=}")
# Access a name
print(f"{data['rooms']=}")
data.dtype=dtype([('nm', '<f8'), ('nmqm', '<f8'), ('wfl', '<f8'), ('rooms', '<f8'), ('bj', '<f8'), ('bez', '<f8'), ('wohngut', '<f8'), ('wohnbest', '<f8'), ('ww0', '<f8'), ('zh0', '<f8'), ('badkach0', '<f8'), ('badextra', '<f8'), ('kueche', '<f8')])
data[1]=np.void((715.82, 11.01, 65.0, 2.0, 1995.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), dtype=[('nm', '<f8'), ('nmqm', '<f8'), ('wfl', '<f8'), ('rooms', '<f8'), ('bj', '<f8'), ('bez', '<f8'), ('wohngut', '<f8'), ('wohnbest', '<f8'), ('ww0', '<f8'), ('zh0', '<f8'), ('badkach0', '<f8'), ('badextra', '<f8'), ('kueche', '<f8')])
data['rooms']=array([2., 2., 3., ..., 3., 1., 3.])

Now that we have some data we can look at it more closely, for this we interpret a row as a vector.

2.1 Basic properties of a data set

First we are looking at the total net rent, i.e. the row nm.

For a vector \(v \in \mathbb{R}^n\) we have:

  • the maximal value, i.e. the maximum \[ v^{max} = \max_i v_i, \]
  • the minimal value, i.e. the minimum \[ v^{min} = \min_i v_i, \]
  • the mean of all values (often called the arithmetic mean) \[ \overline{v} = \frac1n \sum_{i=1}^n v_i = \frac{v_1 + v_2 + \cdots + v_n}{n}, \]
  • the median, i.e. the value where half of all the other values are bigger and the other half is smaller, for a sorted \(v\) this is \[ \widetilde{v} = \begin{cases} v_{(n+1)/2}& n\quad \text{odd}\\ \frac{v_{n/2} + v_{n/2+1}}{2}& n\quad \text{even} \end{cases}, \]
  • more general, we have quantiles. For a sorted \(v\) and \(p\in(0,1)\) \[ \overline{v}_p = \begin{cases} \frac12\left(v_{np} + v_{np+1}\right) & pn \in \mathbb{N}\\ v_{\lfloor np+1\rfloor} & pn \not\in \mathbb{N} \end{cases}. \] Some quantiles have special names, like the median for \(p=0.5\), the lower and upper quartile for \(p=0.25\) and \(p=0.75\) (or first, second (median) and third quartile), respectively.
nm_max = np.max(data['nm'])
print(f"{nm_max=}")

nm_min = np.min(data['nm'])
print(f"{nm_min=}")

nm_mean = np.mean(data['nm'])
# round to 2 digits
nm_mean_r = np.around(nm_mean, 2)
print(f"{nm_mean_r=}")

nm_median = np.median(data['nm'])
print(f"{nm_median=}")

nm_quartiles = np.quantile(data['nm'], [1/4, 1/2, 3/4])
print(f"{nm_quartiles=}")
nm_max=np.float64(1789.55)
nm_min=np.float64(77.31)
nm_mean_r=np.float64(570.09)
nm_median=np.float64(534.3)
nm_quartiles=array([389.95, 534.3 , 700.48])

From this Python snippet we know that for tenants the rent varied between 77.31 and 1789.55, with an average of 570.09 and a median of 534.3. Of course there are tricky questions that require us to dig a bit deeper into these functions, e.g. how many rooms does the most expensive flat have? The surprising answer is 3 and it was built in 1994, but how do we obtain these results?

We can use numpy.argwhere or a function which returns the index directly like numpy.argmax.

max_index = np.argmax(data['nm'])
rooms = int(data['rooms'][max_index])
year = int(data['bj'][max_index])
print(f"{rooms=}, {year=}")
rooms=3, year=1994

2.1.1 Visualization

Tip

There are various ways of visualizing data in Python. Two widely used packages are matplotlib and plotly.

It often helps to visualize the values to see differences and get an idea of their use.

Show the code for the figure
import matplotlib.pyplot as plt
nm_sort = np.sort(data["nm"])
x = np.linspace(0, 1, len(nm_sort), endpoint=True,)

plt.plot(x, nm_sort, label="net rent")
plt.axis((0, 1, np.round(nm_min/100)*100, np.round(nm_max/100)*100))
plt.xlabel('Scaled index')
plt.ylabel('Net rent - nm')

plt.plot([0, 0.25, 0.25], [nm_quartiles[0], nm_quartiles[0], nm_min], 
         label='1st quartile')
plt.plot([0, 0.5, 0.5], [nm_quartiles[1], nm_quartiles[1], nm_min],
         label='2st quartile')
plt.plot([0, 0.75, 0.75], [nm_quartiles[2], nm_quartiles[2], nm_min],
         label='3st quartile')
plt.plot([0, 1], [nm_mean, nm_mean],
         label='mean')
plt.legend()
plt.show()
Figure 2.1: Visualization of the different measurements.

What is shown in Figure 2.1 is often combined into a single boxplot (see Figure 2.2) that provides way more information at once.

Show the code for the figure
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Box(y=data["nm"], name="Standard"))
fig.add_trace(go.Box(y=data["nm"], name="With points", boxpoints="all"))
fig.show()
Figure 2.2: Boxplot done in plotly with whiskers following 3/2 IQR.

The plot contains the box which is defined by the 1st quantile \(Q_1\) and the 3rd quantile \(Q_3\), with the median as line in between these two. Furthermore, we can see the whiskers which help us identify so called outliers. By default they are defined as \(\pm 1.5(Q_3 - Q_1)\), where (\(Q_3 - Q_1\)) is often called the interquartile range (IQR).

Note

Figure 2.2 is an interactive plot in the html version.

2.2 Spread

The spread (or dispersion, variability, scatter) are measures used in statistics to classify how data is distributed. Common examples are variance, standard deviation, and the interquartile range that we have already seen above.

Definition 2.1 (Variance) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the variance is defined as \[ \operatorname{Var}(v) = \frac1n \sum_{i=1}^n (v_i - \mu)^2, \quad \mu = \overline{v} \quad\text{(the mean)} \] or directly \[ \operatorname{Var}(v) = \frac{1}{n^2} \sum_{i=1}^n\sum_{j>i} (v_i - v_j)^2. \]

Definition 2.2 (Standard deviation) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the standard deviation is defined as \[ \sigma = \sqrt{\frac1n \sum_{i=1}^n (v_i - \mu)^2}, \quad \mu = \overline{v} \quad\text{(the mean)}. \] If we interpret \(v\) as a sample this is often also called uncorrected sample standard deviation.

Definition 2.3 (Interquartile range (IQR)) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the interquartile range is defined as the difference of the first and third quartile, i.e. \[ IQR = \overline{v}_{0.75} - \overline{v}_{0.25}. \]

With numpy they are computed as follows

nm_var = np.var(data["nm"])
print(f"{nm_var=}")

nm_std = np.std(data["nm"])
print(f"{nm_std=}")

nm_IQR = nm_quartiles[2] - nm_quartiles[0]
print(f"{nm_IQR=}")
nm_var=np.float64(60208.75551600402)
nm_std=np.float64(245.37472468859548)
nm_IQR=np.float64(310.53000000000003)

2.3 Histogram

When exploring data it is also quite useful to draw histograms. For the net rent this makes not much sense but for rooms this is useful.

Show the code for the figure
index = np.array(range(0, len(data['rooms'])))

plt.hist(data['rooms'])
plt.xlabel('rooms')
plt.ylabel('# of rooms')
plt.show()
Figure 2.3: Histogram of the number of rooms in our dataset.

What we see in Figure 2.3 is simply the amount of occurrences of \(1\) to \(6\) in our dataset. Already we can see something rather interesting, there are flats with \(5.5\) rooms in our dataset.

Another helpful histogram is Figure 2.4 showing the amount of buildings built per year.

Show the code for the figure
index = np.array(range(0, len(data['rooms'])))

plt.hist(data['bj'])
plt.xlabel('year of building')
plt.ylabel('# of buildings')
plt.show()
Figure 2.4: Histogram of buildings built per year.

2.4 Correlation

In statistics, the terms correlation or dependence describe any statistical relationship between bivariate data (data that is paired) or random variables.

For our dataset we can, for example, check:

  1. the living area in \(m^2\) - wfl vs. the net rent - nm
  2. the year of construction - bj vs. if central heating - zh0 is available
  3. the year of construction - bj vs. the city district - bez
Show the code for the figure
from plotly.subplots import make_subplots

fig = make_subplots(rows=3, cols=1)

fig.add_trace(go.Scatter(x=data["wfl"], y=data["nm"], mode="markers"),
                row=1, col=1)
fig.update_xaxes(title_text="living area in m^2", row=1, col=1)
fig.update_yaxes(title_text="net rent", row=1, col=1)

fig.add_trace(go.Scatter(x=data["bj"], y=data["zh0"], mode="markers"),
                row=2, col=1)
fig.update_xaxes(title_text="year of construction", row=2, col=1)
fig.update_yaxes(title_text="central heating", row=2, col=1)

fig.add_trace(go.Scatter(x=data["bj"], y=data["bez"], mode="markers"),
                row=3, col=1)
fig.update_xaxes(title_text="year of construction", row=3, col=1)
fig.update_yaxes(title_text="city district", row=3, col=1)

fig.show()
Figure 2.5: Scatterplot to investigate correlations in the data set.