Training/Dev/Test Set
what is training set / dev set / test set
In traditional methodology/ when we have small size data we can take 60-20-20 ratio to get training set-validation set/dev set -test set.
Now, when we have big data it is fine that the dev set or test set to be less than 10 or 20 percent of your data. Or even 98-1-1 ratio is also fine.
One rule of thumb is : Test set and Dev set should come from same distribution.
Bias and Variance
Bias means the high error rate in training. I may be due to underfitting. For this we can change neural network architecture like network size and number of iterations. Varaince means error rate in Dev set . This may be due to Over fitting of the data . This can be avoided by increasing number of data and regularization.
Bias - Variance trade off means balancing both without the increase in other. Regularization is used to reduce the variance . It may hurt bias and bias may increase a little but not much if we have a bigger network.
L2 Regularization - For variance problem
It is used for avoiding overfitting in the network. (Add pics of equation).
In Neural network L2 regularization is also called Forbenius form.
It is also known as 'weight decay' . Explain the reason.
Add equations. () . In this we are adding an extra term called 'regularization parameter'- called 'lamda'. So when tuning hyper parameters we should consider this one also.
Why L2 Regularization, how it helps :
* It penalizes the neural network for having larger weights.
* The main idea is having a bigger network will cause overfitting in the data
* Hence L2 regularization maps the weights to zero or in more clear terms it reduces the effect of weights, thus making the network small
* L2 regularization is a very powerful technique and it is mostly used in most of the deep learning works
* When we plot the cost of gradient descent against the number of iterations, if we are using regularization then we can see the drop in cost function monotonically.
* Explain with image how it is reduced (Add):
Drop out regularization
This is another very powerful regulaization method. We can do drop out regularization in different ways. One of them is inverted drop out. In this
some of the the hidden units and its connections are removed from the etwork using one probability .
(Images of how it works)
Another thing in prcatical is for different training sets make the different nodes zero. That is called drop out.
Drop out
In drop out , no hyper parameters are added into the Cost function. We are just eliminating the random nodes. Main use of cost function is in Computer vision. Because in computer vision, there is not much data available. So scientists guess there will be overfitting and so they are adding the drop ot layer strictly.
Data augmentation
If the neural network is overfitting then one way to avoid this is to add more data. But for example. in computer vision the amount of data available will be less and hence we can perform different operations on images like flipping horizontally etc. to increase the training data set. This is called data augmentation.
Early stopping
Early stopping refers to stop the training of neural network early so that weights of the network will be small. Since we are initiating the weights small, after a small number of steps the weights will be equal to zero only, so if we are stopping the training there , it will be similar to l2 regularization and it helps to reduce overfitting. But this is not a good way since it breaks the orthogonality rule of the DNN , that is separate actions for separate functions. In the course Andrew NG prefers L2 regularization more, although finding the lamda is a costly procedure.
what is training set / dev set / test set
In traditional methodology/ when we have small size data we can take 60-20-20 ratio to get training set-validation set/dev set -test set.
Now, when we have big data it is fine that the dev set or test set to be less than 10 or 20 percent of your data. Or even 98-1-1 ratio is also fine.
One rule of thumb is : Test set and Dev set should come from same distribution.
Bias and Variance
Bias means the high error rate in training. I may be due to underfitting. For this we can change neural network architecture like network size and number of iterations. Varaince means error rate in Dev set . This may be due to Over fitting of the data . This can be avoided by increasing number of data and regularization.
Bias - Variance trade off means balancing both without the increase in other. Regularization is used to reduce the variance . It may hurt bias and bias may increase a little but not much if we have a bigger network.
L2 Regularization - For variance problem
It is used for avoiding overfitting in the network. (Add pics of equation).
In Neural network L2 regularization is also called Forbenius form.
It is also known as 'weight decay' . Explain the reason.
Add equations. () . In this we are adding an extra term called 'regularization parameter'- called 'lamda'. So when tuning hyper parameters we should consider this one also.
Why L2 Regularization, how it helps :
* It penalizes the neural network for having larger weights.
* The main idea is having a bigger network will cause overfitting in the data
* Hence L2 regularization maps the weights to zero or in more clear terms it reduces the effect of weights, thus making the network small
* L2 regularization is a very powerful technique and it is mostly used in most of the deep learning works
* When we plot the cost of gradient descent against the number of iterations, if we are using regularization then we can see the drop in cost function monotonically.
* Explain with image how it is reduced (Add):
Drop out regularization
This is another very powerful regulaization method. We can do drop out regularization in different ways. One of them is inverted drop out. In this
some of the the hidden units and its connections are removed from the etwork using one probability .
(Images of how it works)
Another thing in prcatical is for different training sets make the different nodes zero. That is called drop out.
Drop out
In drop out , no hyper parameters are added into the Cost function. We are just eliminating the random nodes. Main use of cost function is in Computer vision. Because in computer vision, there is not much data available. So scientists guess there will be overfitting and so they are adding the drop ot layer strictly.
Data augmentation
If the neural network is overfitting then one way to avoid this is to add more data. But for example. in computer vision the amount of data available will be less and hence we can perform different operations on images like flipping horizontally etc. to increase the training data set. This is called data augmentation.
Early stopping
Early stopping refers to stop the training of neural network early so that weights of the network will be small. Since we are initiating the weights small, after a small number of steps the weights will be equal to zero only, so if we are stopping the training there , it will be similar to l2 regularization and it helps to reduce overfitting. But this is not a good way since it breaks the orthogonality rule of the DNN , that is separate actions for separate functions. In the course Andrew NG prefers L2 regularization more, although finding the lamda is a costly procedure.
Comments
Post a Comment