Reading notes: Regularization techniques in Deep Learning (part 1)

Regularization: any modification we make to a learning algorithm to reduce its generalization error but not its training error.

Many regularization strategies:

  • Put extra constraints on ML model (parameter values)
  • Add extra terms in objective function (viewed as soft constraint on parameter values)

Constraints/penalties in regularization are designed to:

  • Encode specific kinds of prior knowledge
  • Express a generic preference for a simpler model class
  • Make an underdetermined problem determined
  • Combine multiple hypotheses that explain the training data (eg, ensemble methods)

In DL, most regularization strategies are based on regularizing estimators

  • Trading increased bias for reduced variance

We are always trying to fit a square peg (the data-generating process) into a round hole (our model family)

Strategy: Parameter norm penalties (L1, L2) ?

  • Limit the capacity of model
  • For NN, we typically penalize only the weights of the affine transformation at each layer and leaves biases unregularized
    • Bias requires less data than weights to fit accurately
    • Each weight specifies how two variables interact
    • Each bias controls only a single variable
    • We do not induce too much variance by leaving biases unregularized
    • Regularizing bias can introduce a significant amount of underfitting

L2 norm penalty known as weight decay.

Adding weight decay term shrinks weight vector by a constant factor on each step, just before performing usual gradient update.

Many regularization strategies can be viewed as MAP Bayesian inference. L2 regularization is equivalent to MAP bayesian inference with a Gaussian prior on weights.

Read section 4.4 for constructing a generalized Lagrange function.

Strategy: Dataset augmentation

Have more data. Create fake data.

Strategy: Noise injection

Injecting noise in the input

  • NN prove not be very robust to noise.
    • Simply train them with random noise applied to their inputs
    • Apply noise to hidden units (can be viewed as augmenting data at multiple levels of abstraction)
    • Dropout (can be viewed as way to create new inputs by multiplying by noise)
  • Can be much more powerful than simply shrinking parameters, especially when noise is added to hidden units
  • Add noise to weights (used primarily in RNN)
    • Can be viewed as a stochastic implementation of bayesian inference over weights
    • Bayesian treatment of learning considers model weights to be uncertain and representable via a probability distribution that reflects this uncertainty
    • Add noise to weights is a practical, stochastic way to reflect this uncertainty
  • Add noise to output targets
    • Assume for a small constant e, the training set label y is correct with probability 1-e, and otherwise any of the other possible labels might be correct
    • This assumption is easy to incorporate into cost function analytically
    • Label smoothing
      • Pros: prevent pursuit of hard probabilities without discouraging correct classification

For some models, adding noise with infinitesimal variance at input of model is equivalent to imposing a penalty on norm of weights.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s