# Reading notes: Regularization techniques in Deep Learning (part 1)

Regularization: any modification we make to a learning algorithm to reduce its generalization error but not its training error.

Many regularization strategies:

• Put extra constraints on ML model (parameter values)
• Add extra terms in objective function (viewed as soft constraint on parameter values)

Constraints/penalties in regularization are designed to:

• Encode specific kinds of prior knowledge
• Express a generic preference for a simpler model class
• Make an underdetermined problem determined
• Combine multiple hypotheses that explain the training data (eg, ensemble methods)

In DL, most regularization strategies are based on regularizing estimators

• Trading increased bias for reduced variance

We are always trying to fit a square peg (the data-generating process) into a round hole (our model family)

## Strategy: Parameter norm penalties (L1, L2) ?

• Limit the capacity of model
• For NN, we typically penalize only the weights of the affine transformation at each layer and leaves biases unregularized
• Bias requires less data than weights to fit accurately
• Each weight specifies how two variables interact
• Each bias controls only a single variable
• We do not induce too much variance by leaving biases unregularized
• Regularizing bias can introduce a significant amount of underfitting

L2 norm penalty known as weight decay.

Adding weight decay term shrinks weight vector by a constant factor on each step, just before performing usual gradient update.

Many regularization strategies can be viewed as MAP Bayesian inference. L2 regularization is equivalent to MAP bayesian inference with a Gaussian prior on weights.

Read section 4.4 for constructing a generalized Lagrange function.

## Strategy: Dataset augmentation

Have more data. Create fake data.

## Strategy: Noise injection

Injecting noise in the input

• NN prove not be very robust to noise.
• Simply train them with random noise applied to their inputs
• Apply noise to hidden units (can be viewed as augmenting data at multiple levels of abstraction)
• Dropout (can be viewed as way to create new inputs by multiplying by noise)
• Can be much more powerful than simply shrinking parameters, especially when noise is added to hidden units
• Add noise to weights (used primarily in RNN)
• Can be viewed as a stochastic implementation of bayesian inference over weights
• Bayesian treatment of learning considers model weights to be uncertain and representable via a probability distribution that reflects this uncertainty
• Add noise to weights is a practical, stochastic way to reflect this uncertainty
• Add noise to output targets
• Assume for a small constant e, the training set label y is correct with probability 1-e, and otherwise any of the other possible labels might be correct
• This assumption is easy to incorporate into cost function analytically
• Label smoothing
• Pros: prevent pursuit of hard probabilities without discouraging correct classification

For some models, adding noise with infinitesimal variance at input of model is equivalent to imposing a penalty on norm of weights.

Advertisements