12 Common Challenges
This concludes our introduction to neural networks. We introduced the basic concept, how it relates back to regression and how we can extend it to outperform regression methods. The illustrations might not always be the most evolved but they are crucial to convey the concepts and get started in the topic. We did cover some specific architectures and types of neural networks like CNNs and Autoencoders.
Nevertheless, there are some topics we could not cover but that often surface when working with neural networks. We use this space to briefly discuss some aspects. The idea is not a comprehensive discussion but rather to give some guidance if problems occur. See Geron (2022) for more details on many on the mentioned subjects and original references.
Vanishing/Exploding gradient: The working horse of backpropagation is the (stochastic) gradient descent method. The problem that might occur is that the gradient becomes smaller and smaller when propagated through the layers and the change for weights becomes almost null during training. On the other side it can also run in the opposite direction. This problems are often cited for one of the main reasons that NN where abandoned in the early 2000s. In Geron (2022) we can find quite a nice introduction in the corresponding chapter. In short, a combination of activation function and initialization of weights caused the problem is some cases. By introducing new weight initializations this could be mitigated and therefore it is not such a huge problem, if you see it change the weight init or rethink the choice of activation functions.
Dying ReLUs: The vanishing gradient problem helped ReLU to rise to prominence, as it does not saturate like the sigmoid. Nevertheless, there is another problem where some neurons die during training (they output 0 only). This happens when the weights and bias always result in a negative number for any input. To mitigate this problem, leaky ReLu was introduces. The so called SELU and ELU are alternatives as well.
Batch Normalization: With better weight initialization and ReLU activation vanishing gradients can still happen. To have an additional safeguard batch normalization was introduced. After the activation it normalizes over a batch or training data and therefore the network learns the optimal mean and shift for the data. This is often the first layer and helps preparing the data and makes sure everything is in an expected range.
Model/Data Drift: Over the lifetime of a model it can happen that the output or performance degenerates. This might be due to a shift in the input data that is not known or considered. E.g. consider changing the camera or updating the software. The output might afterwards change in a way we do not see with the naked eye but the model perceives. Retraining the model with new input will most likely fix the problem. The frequency of required retraining is hard to predict so monitoring the performance of the model is key.
Overfitting: This is a serious problem that we addressed thought the notes and is listed here as it is often the cause of poor performance and poor generalization. There are some common strategies to mitigate it like data augmentation and others.
Underfitting: Our model can also suffer from underfitting, i.e. poor performance. Try training for longer, find better features and if this does not help increase the complexity of the model.