Data Management and Data Engineering

In the previous sections we discussed classification with the help of various examples. We did not talk about how we provide the observations/data for our training, validation, and testing. They where just there or more precisely, we just loaded them. Furthermore, we also did not talk about how to persist the model we generated, after the program finished the model was gone again and needs to be retrained to further use it. Nevertheless, data management and data engineering are essential aspects of our topics at hand. As with other topics discussed in these notes, we can not hope to cover the entire field comprehensively, but we can provide an introduction and highlight the main and basic concepts that help us to get started. Therefore, we will focus on a couple of aspects and use only one specific software solution for the implementation. On the one hand, this limits us and the notes might will be outdated sooner but in the spirit of the practical introduction it will allow us to work on the topics at hand. The intend is to see it in action and to help understand the practical aspects of the challenges within these topics.

There are several trivial but important key aspects to managing data:

without data we can not hope to train a model
we need to keep track of the state of our data and the data changes to allow reproducibility and also to detect drifts¹
training times are often long and we need to store the resulting model together with the code and the data we used for training

and so much more.

The last step in 3 Semi-Supervised learning is actually quite a useful illustration to highlight the importance of having correct data. We could see that with wrong labels we can not hope to generate correct result. Consequently, if we can not say for sure if we had correct labels for the training three months ago we can not validate results.

In addition, if we move to images, we could also see how changing the basis (raw to wavelet) changed our results, see the cats and dogs example of the introduction to clustering and classification. As each pixel of an image can be considered a feature for our training, we can imagine that changing only a small amount might change our model and our performance. Now consider that most formats for storing images include some kind of image compression (see some discussed in Kandolf (2025)) we can imagine that this can have major influence on the training and resulting model. Therefore, we also need to keep track how these features (our images) are generated and stored if we also want to make sure we do not get unexpected behaviour in our results.

The entire field of data management and data engineering is not new but received a lot of focus in the machine learning age, it also spawned several (research) fields which are often captured under the umbrella of data science.

A nice deep dive into the topics that focuses on concepts and not on technologies is Reis (2022). It build on the so called data engineering lifecycle illustrated in Figure 1.

Figure 1: Illustration of a common data engineering lifecycle in data processing. The *related topics* are often referred to as *undercurrents*.

In this chapter we will see some aspects of the lifecycle in action. We will also integrate our code for model and data generation into this framework. Several aspects to the related topics are very important but can not be fully covered here, we highlight sections of importance.

So let us start and dive into the shallows of data management.

We talk about a drift as an evolution of data that invalidates the data model, see Wikipedia, accessed on the 21.03.2025↩︎