Regularization: any modification we make to a learning algorithm to reduce its generalization error but not its training error.
Many regularization strategies:
- Put extra constraints on ML model (parameter values)
- Add extra terms in objective function (viewed as soft constraint on parameter values)
Constraints/penalties in regularization are designed to:
- Encode specific kinds of prior knowledge
- Express a generic preference for a simpler model class
- Make an underdetermined problem determined
- Combine multiple hypotheses that explain the training data (eg, ensemble methods)
In DL, most regularization strategies are based on regularizing estimators
- Trading increased bias for reduced variance
We are always trying to fit a square peg (the data-generating process) into a round hole (our model family)
- Limit the capacity of model
- For NN, we typically penalize only the weights of the affine transformation at each layer and leaves biases unregularized
- Bias requires less data than weights to fit accurately
- Each weight specifies how two variables interact
- Each bias controls only a single variable
- We do not induce too much variance by leaving biases unregularized
- Regularizing bias can introduce a significant amount of underfitting
L2 norm penalty known as weight decay.
Adding weight decay term shrinks weight vector by a constant factor on each step, just before performing usual gradient update.
Many regularization strategies can be viewed as MAP Bayesian inference. L2 regularization is equivalent to MAP bayesian inference with a Gaussian prior on weights.
Read section 4.4 for constructing a generalized Lagrange function.
Have more data. Create fake data.
Injecting noise in the input
- NN prove not be very robust to noise.
- Simply train them with random noise applied to their inputs
- Apply noise to hidden units (can be viewed as augmenting data at multiple levels of abstraction)
- Dropout (can be viewed as way to create new inputs by multiplying by noise)
- Can be much more powerful than simply shrinking parameters, especially when noise is added to hidden units
- Add noise to weights (used primarily in RNN)
- Can be viewed as a stochastic implementation of bayesian inference over weights
- Bayesian treatment of learning considers model weights to be uncertain and representable via a probability distribution that reflects this uncertainty
- Add noise to weights is a practical, stochastic way to reflect this uncertainty
- Add noise to output targets
- Assume for a small constant e, the training set label y is correct with probability 1-e, and otherwise any of the other possible labels might be correct
- This assumption is easy to incorporate into cost function analytically
- Label smoothing
- Pros: prevent pursuit of hard probabilities without discouraging correct classification
For some models, adding noise with infinitesimal variance at input of model is equivalent to imposing a penalty on norm of weights.