11  Data Preparation

Now that we know how to design, train and work with neural networks we need to talk about one of the most important topic: data. We could see throughout this notes that the algorithms need to have data to learn. This includes labelled and unlabelled data.

There are several topics that need to be addressed when a project does not simply include training on available datasets. We briefly discuss several important topics (some of which we have already heard) before some more extended discussion on a selection. Quite often a data pipeline consists of the steps as listed below (the order may differ).

Note

We list here some general techniques with a focus on image processing. Depending on the domain in question there are often domain-specific challenges to overcome. This includes the domain of processing (computer vision, NLP, time series) but also the domain of you data. The challenges in producing industries is different than astrophysics, medicine or metrology.

Besides these steps that are often part of a data preparation pipeline or directly included in the training there are other important aspects that greatly influence our outcome.

11.1 Data Augmentation

In the previous sections we used our image data either in the wavelet basis, raw or in the case of transfer learning with some augmentations to make sure it fits the input requirements of the network.

Quite often it is advisable to (randomly) augment the data for training and sometimes also for inference. It has shown to be an important technique to improve several aspects of neural networks, like robustness, overfitting, generalization, and more.

With simple augmentations like:

  • Geometric transformations: rotation, flipping, scaling, cropping, translation, etc.
  • Colour adjustments: brightness, contrast, saturation/hue, monochrome images, etc.
  • Artificial noise: random noise to simulate real-world imperfections like dust, glare, focus, etc.
  • Random erasing: by masking out random parts of an image robustness can be improved and it simulates occlusions.
  • Combination of images: to create new images for training it is sometimes possible to combine existing images.

we can influence the training and the model drastically. For pytorch we can find most of these in torchvison.transforms.v2, see docs for some insights.

The most important aspects influenced by these augmentations are:

  1. Dataset: With data augmentation we can artificially increase the size of our dataset and therefore expose our network to more training data without the often tedious task of labeling. If we apply this in a random fashion we need to make sure that the seed is set and we can track the changes in case we need to get a deeper understanding of the mechanics behind it. Overall, by always applying some augmentation, overfitting can be drastically reduced. If, for some reason, your dataset is not balanced and one class is less frequent than it should be, these techniques can also produce more instances. This process is called synthetic minority oversampling techniques (SMOTE).

  2. Generalization: With data augmentation we can make sure the model learns on features that are invariant to geometric transformations or something like different zoom levels. Overall, this will lead to a better generalization of the network.

  3. Overfitting: Especially for small dataset overfitting is a serious problem and the augmentation techniques are a reliable tool to mitigate the risk of overfitting to your training set.

  4. Performance: Quite often it also helps to introduce data augmentations during training achieve better results on several performance metrics like accuracy.

  5. Simulates real-world variability: When taking an image there are usually variations included. This might be orientation, occlusions, lighting conditions and so forth. We can simulate these problems via data augmentation.

For instance, if a model is trained to classify images of cats and dogs, data augmentation might include flipping images horizontally, rotating them slightly, or adjusting brightness. This ensures the model learns to recognize cats and dogs regardless of orientation or lighting conditions.

Exercise 11.1 (Use data augmentation to improve our models) Similar to the pipeline use in Chapter 10 create a pipeline to include some of the above described data augmentations and retrain the models we created with this pipeline in place for the training set. Can you see any of the above described benefits?

11.2 Data Labeling

Labeling data is a critical step in machine learning projects as the quality directly influences the performance of the model. For image data the process of labeling involves assigning meaningful annotations to images (what is in the image, or what classes it belongs to), bounding boxes or masks for objects visible in the image. Furthermore, these informations should be easy to process by a machine and keeping a change log is also recommended.

Depending on the task we set out to achieve, the requirements on labelling can be quite different. The following list covers some tasks and also highlights where specific labels can be problematic.

  • Image classification: Each image has a single label e.g., cat or dog. This can cause problems, especially if we decide to move from superclasses like animal, for lagged, furry to subclasses or specific species of cats.

  • Object detection: Annotating bounding boxes around objects within an image. This can be tricky in the case of objects overlapping each other or for a sequence of images where parts get obfuscated.

  • Semantic segmentation: Assigning a class label to each pixel in an image. This can cause problems with class imbalance if e.g. background is the dominate class. Another problem occurs when capturing precise object boundaries and the challenges involved here with down and upsampling. Furthermore, generalization can be a problem in these tasks if e.g. background is always black.

  • Instance segmentation: This builds on sematic segmentation but distinguishes between different individual of the same instance, e.g. two different cats in one image. One obvious challenge here is that even for a human labelling it might not always be clear if a pixel belongs to one instance or the other if the overlap.

Above be discussed some specific challenges depending on the topic but there are also more general challenges in the context of labelling image data.

  • Subjectivity: Images allow for some form of interpretation what exactly is part of the image, instance. This might be where one object begins and another ends, e.g. satellite data and we need do distinguish between a roof with grass and an actual field, not to talk about different plants in the background of image. Furthermore, if we try to label emotions this might be hard due to cultural differences. The same is true for abstract concepts.

  • Scalability: The larger the dataset, the more time consuming and challenging the labelling process becomes. This is e.g. why old CAPTCHA algorithms employed users to read text to help improve OCR software1. The task is even more challenging for the pixel labels required for segmentation.

  • Imbalance: This can be more subtle than we often think. E.g. there are not a lot of images of snow leopards and those existing might be from the same individual leading to problems we already discussed.

  • Errors: It is possible that labels are wrong or inconsistent. Such problems can propagate and influence workflow. While we can simply correct an error it might also be beneficial to keep track of the wrong labels to make sure we capture model drift and also to work against systematic errors.

  • Cost: The costs for high-quality labelled data was made implicitly clear in the above points but we need to mention it separately as it can have serious implications on a project.

Best practices

In the light of the above challenges it is a good idea keep the following checks in mind. Start of small to establish a process, gave labeling guidelines in place to allow for reproducible results, if possible allow for multiple annotators, implement a quality control, leverage pre-labelled data if possible.

11.2.1 Tools for Labeling

There are several tools available to help in the labelling process.

  • Label Studio: Open-source tool for general data labelling (free, commercial with extended security features available)
  • CVAT: Open-source tool specific for computer vision labelling tasks (commercial product available)
  • SuperAnnotate: General data labelling (commercial)
  • VGG Image Annotator (VIA): Open-source tool for image, audio and video annotation (free)
  • Amazon SageMaker Ground Truth: An annotation tool from AWS (commercial)
  • Labelbox: General data labelling (commercial)

As Label Studio is one of the most versatile and free tools available we highlight some of the main features as a general guide on how these tools work and what kind of workflows they provide.

Label Studio supports a wide range of labelling/annotation tasks from image classification via object detection to segmentation tasks. This is handled via multiple types of labels like bounding boxes or polygons and allows cooperative processes. We can use tools ans assistant systems to facilitate semi-automated labelling where the human performs fine tuning and checks the results. Furthermore, we can define workflows and have quality control features (consensus scoring) available.


  1. Mix some known letters or words to check if the user is correct and collect the data for the second part to build up a statistically significant classification.↩︎