Regularization: any modification we make to a learning algorithm to reduce its generalization error but not its training error.

Many regularization strategies:

- Put extra constraints on ML model (parameter values)
- Add extra terms in objective function (viewed as soft constraint on parameter values)

Constraints/penalties in regularization are designed to:

- Encode specific kinds of prior knowledge
- Express a generic preference for a simpler model class
- Make an underdetermined problem determined
- Combine multiple hypotheses that explain the training data (eg, ensemble methods)

In DL, most regularization strategies are based on regularizing estimators

- Trading increased bias for reduced variance

We are always trying to fit a square peg (the data-generating process) into a round hole (our model family)

## Strategy: Parameter norm penalties (L1, L2) ?

- Limit the capacity of model
- For NN, we typically penalize only the weights of the affine transformation at each layer and leaves biases unregularized
- Bias requires less data than weights to fit accurately
- Each weight specifies how two variables interact
- Each bias controls only a single variable
- We do not induce too much variance by leaving biases unregularized
- Regularizing bias can introduce a significant amount of underfitting

L2 norm penalty known as weight decay.

Adding weight decay term shrinks weight vector by a constant factor on each step, just before performing usual gradient update.

Many regularization strategies can be viewed as MAP Bayesian inference. L2 regularization is equivalent to MAP bayesian inference with a Gaussian prior on weights.

Read section 4.4 for constructing a generalized Lagrange function.

## Strategy: Dataset augmentation

Have more data. Create fake data.

## Strategy: Noise injection

Injecting noise in the input

- NN prove not be very robust to noise.
- Simply train them with random noise applied to their inputs
- Apply noise to hidden units (can be viewed as augmenting data at multiple levels of abstraction)
- Dropout (can be viewed as way to create new inputs by multiplying by noise)

- Can be much more powerful than simply shrinking parameters, especially when noise is added to hidden units
- Add noise to weights (used primarily in RNN)
- Can be viewed as a stochastic implementation of bayesian inference over weights
- Bayesian treatment of learning considers model weights to be uncertain and representable via a probability distribution that reflects this uncertainty
- Add noise to weights is a practical, stochastic way to reflect this uncertainty

- Add noise to output targets
- Assume for a small constant e, the training set label y is correct with probability 1-e, and otherwise any of the other possible labels might be correct
- This assumption is easy to incorporate into cost function analytically
- Label smoothing
- Pros: prevent pursuit of hard probabilities without discouraging correct classification

For some models, adding noise with infinitesimal variance at input of model is equivalent to imposing a penalty on norm of weights.