11 Data Preparation

Now that we know how to design, train and work with neural networks we need to talk about one of the most important topic: data. We could see throughout this notes that the algorithms need to have data to learn. This includes labelled and unlabelled data.

There are several topics that need to be addressed when a project does not simply include training on available datasets. We briefly discuss several important topics (some of which we have already heard) before some more extended discussion on a selection. Quite often a data pipeline consists of the steps as listed below (the order may differ).

Note

We list here some general techniques with a focus on image processing. Depending on the domain in question there are often domain-specific challenges to overcome. This includes the domain of processing (computer vision, NLP, time series) but also the domain of our data. The challenges in producing industries are different than astrophysics, medicine or metrology.

Cleaning and preprocessing: The raw data often needs some cleaning and preprocessing. For images this starts with an appropriate image format - remember that most formats include some form of compression, that can lead to artifacts. We might also need to handle missing, blurry, out of focus, fuzzy, hazy, or grainy images. It might also be the useful that we detect the region of interest and cut the image to the appropriate dimensions to reduce the size of our problem already on this stage. The main idea of this step is to have data that can be processed in a standardized way.
Dataset splitting: The split of the available data is important. Already at this level it is essential, that for example the test dataset is never touched by the training or validation.
Class imbalance: A real-world dataset is often not balanced or not as balanced as it should be for good learning behaviour in NNs. The training techniques need to be adapted or the sample selection needs to be adjusted. We discussed some of the problems in Section 2.2. Some of the data augmentation techniques we discuss below Section 11.1 can help to mitigate these problems.
Feature engineering: While NNs can be used to detect the important features it does not mean that for every application we can skip feature engineering. We might want to create new features from existing ones (e.g. transforming to wavelet basis to highlight vertical and horizontal features) or reduce to the most important features like only the edges.
Labeling: The process of labeling data can be tedious and time consuming. Arguably it is the most important part as all supervised learning relies on the quality of the labels. We discuss this in more detail below, Section 11.2.
Augmentation: By augmenting the data we have we can also greatly influence the training of our NNs. In short, we can use geometric transformations, colour adjustments, artificial noise, and random erasing to augment our data and create new data from the NNs point of view. We discuss this step in more detail in Section 11.1.

Besides these steps that are often part of a data preparation pipeline or directly included in the training there are other important aspects that greatly influence our outcome.

Validation and quality assurance: It is important to have some processes in place that ensure the quality of the data is accurate, especially if data is automatically generated and labelled. Making sure the labels are valid, and the quality of the data is as desired makes sure the training can have major impact on the performance. It is also important to keep track of how balanced our dataset is.
Scalability and efficiency: Quite likely our dataset is growing over time and this can cause problems in terms of scalability and efficiency. You might need to apply batch processing and parallelization techniques to make sure the data can be processed at all. Furthermore, the way it is stored and provided for the NN can often become the bottleneck of training.
Ethical considerations: A bias in the dataset can create a bias in the model we train. This is especially important if we are working with sensitive data. Furthermore, ethical considerations and GDPR considerations should never be ignored during data collection and use.

11.1 Data Augmentation

In the previous sections we used our image data either in the wavelet basis, raw or in the case of transfer learning with some augmentations to make sure it fits the input requirements of the network.

Quite often it is advisable to (randomly) augment the data for training and sometimes also for inference. It has shown to be an important technique to improve several aspects of neural networks, like robustness, overfitting, generalization, and more.

With simple augmentations like:

Geometric transformations: rotation, flipping, scaling, cropping, translation, etc.
Colour adjustments: brightness, contrast, saturation/hue, monochrome images, etc.
Artificial noise: random noise to simulate real-world imperfections like dust, glare, focus, etc.
Random erasing: by masking out random parts of an image robustness can be improved and it simulates occlusions.
Combination of images: to create new images for training it is sometimes possible to combine existing images.

We can influence the training and the model drastically. For pytorch we can find most of these in torchvison.transforms.v2, see docs for some insights.

The most important aspects influenced by these augmentations are:

Dataset: With data augmentation we can artificially increase the size of our dataset and therefore expose our network to more training data without the often tedious task of labeling. If we apply this in a random fashion we need to make sure that the seed is set and we can track the changes in case we need to get a deeper understanding of the mechanics behind it. Overall, by always applying some augmentation, overfitting can be drastically reduced. If our dataset is not balanced and one class is less frequent than it should be, these techniques can also produce more instances. This process is called synthetic minority oversampling techniques (SMOTE).
Generalization: With data augmentation we can make sure the model learns on features that are invariant to geometric transformations or something like different zoom levels. Overall, this will lead to a better generalization of the network.
Overfitting: Especially for small dataset overfitting is a serious problem and the augmentation techniques are a reliable tool to mitigate the risk of overfitting to your training set.
Performance: Quite often it also helps to introduce data augmentations during training achieve better results on several performance metrics like accuracy.
Simulates real-world variability: When taking an image there are usually variations included. This might be orientation, occlusions, lighting conditions and so forth. We can simulate these problems via data augmentation.

For instance, if a model is trained to classify images of cats and dogs, data augmentation might include flipping images horizontally, rotating them slightly, or adjusting brightness. This ensures the model learns to recognize cats and dogs regardless of orientation or lighting conditions.

Exercise 11.1 (Use data augmentation to improve our models) Similar to the pipeline use in Chapter 10 create a pipeline to include some of the above described data augmentations and retrain the models we created with this pipeline in place for the training set. Can you see any of the above described benefits?

11.2 Data Labeling

Labeling data is a critical step in machine learning projects as the quality directly influences the performance of the model. For image data the process of labeling involves assigning meaningful annotations to images (what is in the image, or what classes it belongs to), bounding boxes or masks for objects visible in the image. Furthermore, these information should be easy to process by a machine and keeping a change log is also recommended.

Depending on the task we set out to achieve, the requirements on labelling can be quite different. The following list covers some tasks and also highlights where specific labels can be problematic.

Image classification: Each image has a single label e.g., cat or dog. This can cause problems, especially if we decide to move from superclasses like animal, for legged, furry to subclasses or specific species of cats.
Object detection: Annotating bounding boxes around objects within an image. This can be tricky in the case of objects overlapping each other or for a sequence of images where parts get obfuscated.
Semantic segmentation: Assigning a class label to each pixel in an image. This can cause problems with class imbalance if e.g. background is the dominate class. Another problem occurs when capturing precise object boundaries and the challenges involved here with down and upsampling. Furthermore, generalization can be a problem in these tasks if e.g. background is always black.
Instance segmentation: This builds on semantic segmentation but distinguishes between different individual of the same instance, e.g. two different cats in one image. One obvious challenge here is that even for a human labelling it might not always be clear if a pixel belongs to one instance or the other if the overlap.

Above, we discussed some specific challenges depending on the topic but there are also more general challenges in the context of labelling image data.

Subjectivity: Images allow for some form of interpretation what exactly is part of the image, instance. This might be where one object begins and another ends, e.g. satellite data and we need to distinguish between a roof with grass and an actual field, not to talk about different plants in the background of image. Furthermore, if we try to label emotions this might be hard due to cultural differences. The same is true for abstract concepts.
Scalability: The larger the dataset, the more time consuming and challenging the labelling process becomes. This is e.g. why old CAPTCHA algorithms employed users to read text to help improve OCR software¹. The task is even more challenging when pixel-level labels are required for segmentation.

Imbalance: This can be more subtle than we often think. E.g. there are not a lot of images of snow leopards and those existing might be from the same individual leading to problems we already discussed.
Errors: It is possible that labels are wrong or inconsistent. Such problems can propagate and influence workflow. While we can simply correct an error it might also be beneficial to keep track of the wrong labels to make sure we capture model drift and also to work against systematic errors.
Cost: The costs for high-quality labelled data were made implicitly clear in the above points but we need to mention it separately as it can have serious implications on a project.

Best practices

In light of the above challenges, it is a good idea to keep the following checks in mind. Start small to establish a process, have labeling guidelines in place to allow for reproducible results, if possible allow for multiple annotators, implement a quality control, leverage pre-labelled data if possible.

11.2.1 Tools for Labeling

There are several tools available to help in the labelling process.

Label Studio: Open-source tool for general data labelling (free, commercial with extended security features available)
CVAT: Open-source tool specific for computer vision labelling tasks (commercial product available)
SuperAnnotate: General data labelling (commercial)
VGG Image Annotator (VIA): Open-source tool for image, audio and video annotation (free)
Amazon SageMaker Ground Truth: An annotation tool from AWS (commercial)
Labelbox: General data labelling (commercial)

As Label Studio is one of the most versatile and free tools available we highlight some of the main features as a general guide on how these tools work and what kind of workflows they provide.

Label Studio supports a wide range of labelling/annotation tasks from image classification via object detection to segmentation tasks. This is handled via multiple types of labels like bounding boxes or polygons and allows cooperative processes. We can use tools ans assistant systems to facilitate semi-automated labelling where the human performs fine tuning and checks the results. Furthermore, we can define workflows and have quality control features (consensus scoring) available.

Mix some known letters or words to check if the user is correct and collect the data for the second part to build up a statistically significant classification.↩︎