Skip to main content

Regularization in Deep Learning / Machine Learning - Prevent Overfitting

image source: mlexplained

Overfittng happens in every machine learning (ML) problem. Learning how to deal with overfitting is essential to mastering machine learning. The fundamental issue in machine learning is the tension between optimization and generalization. Optimization refers to the process of adjusting a model to get the best performance possible on the training data (the learning in machine learning), whereas generalization refers to how well the trained model performs on data it has never seen before. The goal of the game is to get good generalization, of course, but you don’t control generalization; you can only adjust the model based on its training data. The processing of fighting overfitting is a way  called regularization. [1]. 


How do you know whether a model is overfitting?

The best initial method is to measure error on a training and test set. If you see a low error on the training set and high error on test & validation set then you have likely over-fitted the model. Or, if both are low, test your model in the wild, on unseen data (production or AB Test in most domains). [2]


Few techniques to prevent overfitting

Reduce the network size:  The simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer). [1]

Adding weight regularization:  given two explanations for something, the explanation most likely to be correct is the simplest one—the one that makes fewer assumptions. This idea also applies to the models learned by neural networks: given some training data and a network architecture, multiple sets of weight values could explain the data. Simpler models are less likely to overfit than complex ones. A simple model in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called weight regularization, and it’s done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors: L1 regularization—The cost added is proportional to the absolute value of the weight coefficients L2 regularization—The cost added is proportional to the square of the value of the weight coefficients . L2 regularization is also called weight decay in the context of neural networks. [1]

Adding Dropout:  Dropout, applied to a layer, consists of randomly dropping out(setting to zero) a number of output features of the layer during training. Let’s say a given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training. After applying dropout, this vector will have a few zero entries distributed at random: for example, [0, 0.5, 1.3, 0, 1.1]

Data Augmentation: The simplest way to reduce overfitting is to increase the size of the training data. Let’s consider we are dealing with images. In this case, there are a few ways of increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. In the below image, some transformation has been done on the handwritten digits dataset. This technique is known as data augmentation. This usually provides a big leap in improving the accuracy of the model.  [3]



References: 

1. Book: Deep learning with python by Francois Chollet.

2. quora.com 

3. analyticsvidhya.com 

Comments

Popular posts from this blog

Difference between a Singly LinkedList and Doubly LinkedList

DFS Performance Measurement

Completeness DFS is not complete, to convince yourself consider that our search start expanding the left subtree of the root for so long path (maybe infinite) when different choice near the root could lead to a solution, now suppose that the left subtree of the root has no solution, and it is unbounded, then the search will continue going deep infinitely, in this case , we say that DFS is not complete. Optimality  Consider the scenario that there is more than one goal node, and our search decided to first expand the left subtree of the root where there is a solution at a very deep level of this left subtree , in the same time the right subtree of the root has a solution near the root, here comes the non-optimality of DFS that it is not guaranteed that the first goal to find is the optimal one, so we conclude that DFS is not optimal. Time Complexity Consider a state space that is identical to that of BFS, with branching factor b, and we start the search fro...

A Brief Overview of GPT-3 by OpenAI

    You have probably already seen some articles like "A robot wrote an entire article. Aren't you scared yet, human?" So, who is the robot here?    It's GPT-3 model. It's a transformer based language model. The full form of GPT is Generative Pre-trained Transformers. This model is developed by OpenAI. There were GPT-2 and other models released by OpenAI previously. GPT-3 was released in May 2020. GPT-3 is more robust than its predecessors. Though architecturally it doesn't have that mush difference.   GPT-3 can write articles, poems, and even working code for you*, given some context. There are some limitations which I am going explain later in this article. It's a language model means given a text, it probabilistically predicts what tokens from a known vocabulary will come next in that string. So, it's sort of a autocomplete that we see on a phone keyboard. We type a word, and then the keyboard suggests another word that can come next. What sets GPT...