We continue our little introduction by looking at data sets in the sense of a list of values that we want to describe closer.
We use the mieten3.asc from Open Data LMU. The data set contains information about rents in Munich from the year 2003. The columns have the following meaning, see DETAILS:
Variablenbeschreibung:nm Nettomiete in EURnmqm Nettomiete pro m² in EURwfl Wohnfläche in m²rooms Anzahl der Zimmer in der Wohnungbj Baujahr der Wohnungbez Stadtbezirkwohngut Gute Wohnlage? (J=1,N=0)wohnbest Beste Wohnlage? (J=1,N=0)ww0 Warmwasserversorgung vorhanden? (J=0,N=1)zh0 Zentralheizung vorhanden? (J=0,N=1)badkach0 Gekacheltes Badezimmer? (J=0,N=1)badextra Besondere Zusatzausstattung im Bad? (J=1,N=0)kueche Gehobene Küche? (J=1,N=0)
For now, we’ll just show the code without much explanation because we want to jump right in and do not want to delve into how it works. We use a structured array of numpy for it.
import numpy as npimport requestsimport ioresponse = requests.get("https://data.ub.uni-muenchen.de/2/1/miete03.asc")# Transform the content of the file into a numpy.ndarraydata = np.genfromtxt(io.BytesIO(response.content), names=True)# Access the data types and namesprint(f"{data.dtype=}")# Access the second element (row)print(f"{data[1]=}")# Access a nameprint(f"{data['rooms']=}")
Now that we have some data we can look at it more closely, for this we interpret a row as a vector.
2.1 Basic properties of a data set
First we are looking at the total net rent, i.e. the row nm.
For a vector \(v \in \mathbb{R}^n\) we have:
the maximal value, i.e. the maximum \[
v^{max} = \max_i v_i,
\]
the minimal value, i.e. the minimum \[
v^{min} = \min_i v_i,
\]
the mean of all values (often called the arithmetic mean) \[
\overline{v} = \frac1n \sum_{i=1}^n v_i = \frac{v_1 + v_2 + \cdots + v_n}{n},
\]
the median, i.e. the value where half of all the other values are bigger and the other half is smaller, for a sorted \(v\) this is \[
\widetilde{v} = \begin{cases}
v_{(n+1)/2}& n\quad \text{odd}\\
\frac{v_{n/2} + v_{n/2+1}}{2}& n\quad \text{even}
\end{cases},
\]
more general, we have quantiles. For a sorted \(v\) and \(p\in(0,1)\)\[
\overline{v}_p = \begin{cases}
\frac12\left(v_{np} + v_{np+1}\right) & pn \in \mathbb{N}\\
v_{\lfloor np+1\rfloor} & pn \not\in \mathbb{N}
\end{cases}.
\] Some quantiles have special names, like the median for \(p=0.5\), the lower and upper quartile for \(p=0.25\) and \(p=0.75\) (or first, second (median) and third quartile), respectively.
From this Python snippet we know that for tenants the rent varied between 77.31 and 1789.55, with an average of 570.09 and a median of 534.3. Of course there are tricky questions that require us to dig a bit deeper into these functions, e.g. how many rooms does the most expensive flat have? The surprising answer is 3 and it was built in 1994, but how do we obtain these results?
We can use numpy.argwhere or a function which returns the index directly like numpy.argmax.
Figure 2.1: Visualization of the different measurements.
What is shown in Figure 2.1 is often combined into a single boxplot (see Figure 2.2) that provides way more information at once.
Show the code for the figure
import plotly.graph_objects as gofig = go.Figure()fig.add_trace(go.Box(y=data["nm"], name="Standard"))fig.add_trace(go.Box(y=data["nm"], name="With points", boxpoints="all"))fig.show()
Figure 2.2: Boxplot done in plotly with whiskers following 3/2 IQR.
The plot contains the box which is defined by the 1st quantile \(Q_1\) and the 3rd quantile \(Q_3\), with the median as line in between these two. Furthermore, we can see the whiskers which help us identify so called outliers. By default they are defined as \(\pm 1.5(Q_3 - Q_1)\), where (\(Q_3 - Q_1\)) is often called the interquartile range (IQR).
Note
Figure 2.2 is an interactive plot in the html version.
2.2 Spread
The spread (or dispersion, variability, scatter) are measures used in statistics to classify how data is distributed. Common examples are variance, standard deviation, and the interquartile range that we have already seen above.
Definition 2.1 (Variance) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the variance is defined as \[
\operatorname{Var}(v) = \frac1n \sum_{i=1}^n (v_i - \mu)^2, \quad \mu = \overline{v} \quad\text{(the mean)}
\] or directly \[
\operatorname{Var}(v) = \frac{1}{n^2} \sum_{i=1}^n\sum_{j>i} (v_i - v_j)^2.
\]
Definition 2.2 (Standard deviation) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the standard deviation is defined as \[
\sigma = \sqrt{\frac1n \sum_{i=1}^n (v_i - \mu)^2}, \quad \mu = \overline{v} \quad\text{(the mean)}.
\] If we interpret \(v\) as a sample this is often also called uncorrected sample standard deviation.
Definition 2.3 (Interquartile range (IQR)) For a finite set represented by a vector \(v\in\mathbb{R}^n\) the interquartile range is defined as the difference of the first and third quartile, i.e. \[
IQR = \overline{v}_{0.75} - \overline{v}_{0.25}.
\]
When exploring data it is also quite useful to draw histograms. For the net rent this makes not much sense but for rooms this is useful.
Show the code for the figure
index = np.array(range(0, len(data['rooms'])))plt.hist(data['rooms'])plt.xlabel('rooms')plt.ylabel('# of rooms')plt.show()
Figure 2.3: Histogram of the number of rooms in our dataset.
What we see in Figure 2.3 is simply the amount of occurrences of \(1\) to \(6\) in our dataset. Already we can see something rather interesting, there are flats with \(5.5\) rooms in our dataset.
Another helpful histogram is Figure 2.4 showing the amount of buildings built per year.
Show the code for the figure
index = np.array(range(0, len(data['rooms'])))plt.hist(data['bj'])plt.xlabel('year of building')plt.ylabel('# of buildings')plt.show()
Figure 2.4: Histogram of buildings built per year.
2.4 Correlation
In statistics, the terms correlation or dependence describe any statistical relationship between bivariate data (data that is paired) or random variables.
For our dataset we can, for example, check:
the living area in \(m^2\) - wfl vs. the net rent - nm
the year of construction - bj vs. if central heating - zh0 is available
the year of construction - bj vs. the city district - bez
Show the code for the figure
from plotly.subplots import make_subplotsfig = make_subplots(rows=3, cols=1)fig.add_trace(go.Scatter(x=data["wfl"], y=data["nm"], mode="markers"), row=1, col=1)fig.update_xaxes(title_text="living area in m^2", row=1, col=1)fig.update_yaxes(title_text="net rent", row=1, col=1)fig.add_trace(go.Scatter(x=data["bj"], y=data["zh0"], mode="markers"), row=2, col=1)fig.update_xaxes(title_text="year of construction", row=2, col=1)fig.update_yaxes(title_text="central heating", row=2, col=1)fig.add_trace(go.Scatter(x=data["bj"], y=data["bez"], mode="markers"), row=3, col=1)fig.update_xaxes(title_text="year of construction", row=3, col=1)fig.update_yaxes(title_text="city district", row=3, col=1)fig.show()
Figure 2.5: Scatterplot to investigate correlations in the data set.