Reading Notes: Learning Deep Architectures for AI (by Yoshua Bengio 2009)

(The notes for Energy-based Models and Boltzmann Machines are not included here. I will try to add it sometime in the future.)

Assumption: Computational machinery necessary to express complex behaviors requires highly varying mathematical functions (eg, highly non-linear functions).

Two things discussed:

  • Depth of architecture
  • Locality of estimators
    • what matters for generalization is not dimensionality, but instead the number of “variations” of the function we wish to obtain after learning.

Theoretical Limitations of Shallow Architectures

  • Functions that can be compactly represented by a depth k architecture might require an exponential number of computational elements to be represented by a depth k − 1 architecture.
  • Since the number of computational elements one can afford depends on the number of training examples available to tune or select them, the consequences are not just computational but also statistical: poor generalization may be expected when using an insufficiently deep architecture for representing some functions.

Depth of an architecture

The depth of an architecture is the maximum length of a path from any input of graph to any output of graph.

  • If we include affine operations and sigmoids in the set of computational elements, linear regression and logistic regression have depth 1, a single level.
  • Kernel machines with a fixed kernel can be considered to have 2 levels.
  • If we put artificial neurons (affine transformation followed by a non-linearity) in our set of computational elements, we get ordinary multi-layer neural network. Most commonly we choose 1 hidden layer, thus having depth 2.
  • Decision trees can be seen as having 2 levels.
  • Boosting usually adds one level to its base learners: the level computes a vote or linear combination of the outputs of the base learners
  • Stacking (wolpert 1992) is another meta-learning algorithm that adds one level.

It is not the absolute number of levels that matters, but the number of levels relative to how many are required to represent efficiently the target function (with some choice of set of computational elements).

The most formal arguments about the power of deep architectures come from investigations into computational complexity of circuits. The basic conclusion that these results suggest is that when a function can be compactly represented by a deep architecture, it might need a very large architecture to be represented by an insufficiently deep one.

  • A two-layer circuit of logic gates can represent any boolean function (Mendelson, 1997).
  • Any boolean function can be written as a sum of products, or a product of sums.
  • Computational complexity results for boolean circuits are relevant to machine learning.
  • Monotone weighted threshold circuits (multi-layer neural networks with linear threshold units and positive weights).

Are the depth 1, 2, and 3 architectures (typically found in most ML algorithms) too shallow to represent efficiently more complicated functions (tasks)?

There might be no universally right depth: each function (task) might require a particular minimum depth (for a given set of computational elements). Therefore, we should strive to develop learning algorithms that use the data to decide the depth of the final architecture.

A function is highly-varying when a piecewise approximation (eg, piecewise-constant or piecewise-linear) of that function would require a large number of pieces.

In general, deep architectures can compactly represent highly-varying function which would otherwise require a very large size to be represented with an inappropriate architecture.

The composition of computational units in a small but deep circuit can be seen as an efficient factorization of a large but shallow circuit.

  • A great example is the polynomial . This can be represented efficiently as a product of sums, with only O(mn) computational elements. But it would be very inefficiently represented with a sum of product architecture, requiring O(n^m) computational elements.
  • Thus, functions that can be compactly represented with a depth k architecture could require a very large number of elements in order to be represented by a shallower architecture.

Local vs Non-Local Generalization: the Limits of Matching Local Templates

Highly-varying functions cannot be represented efficiently if the learning algorithm is a local estimator.

  • A local estimator partitions the input space in regions (possibly in a soft rather than hard way) and requires different parameters or degrees of freedom to account for the possible shape of the target function in each of the regions.
  • More generally, architectures based on matching local templates can be thought of as having two levels. The first level is made of a set of templates which can be matched to the input. A template unit will output a value that indicates the degree of matching. The second level combines these values, typically with a simple linear combination (an OR-like operation), in order to estimate the desired output.

Examples of kernel machines:

  • SVM for classification
  • Gaussian processes for regression
  • KNN (for classification, regression and density estimation)
  • Manifold learning algorithms (Isomap, LLE)

Kernel machines with a local kernel yield generalization by exploiting what could be called the smoothness prior: the assumption that the target function is smooth or can be well approximated with a smooth function.

  • Such prior is useful, but often insufficient to generalize when target function is highly-varying in input space
  • The limitations of a fixed generic kernel such as the Gaussian kernel have motivated a lot of research in designing kernels based on prior knowledge about the task
    • If we lack sufficient prior knowledge for designing an appropriate kernel, can we learn it?
      • Gaussian Process kernel machine can be improved using a Deep Belief Network to learn a feature space
      • Good representations make examples which share abstract characteristics close to each other. Learning algorithms for deep architectures can be seen as ways to learn a good feature space for kernel machines.
  • For complex tasks in high dimension, the complexity of the decision surface could quickly make learning impractical when using a local kernel method.

Decision trees do not generalize to new variations

  • Decision trees are also local estimators in the sense of relying on a partition of the input space and using separate parameters for each region
  • They need at least as many training examples as there are variations of interest in the target function, and they cannot generalize to new variations not covered in the training set.
  • Learning algorithms for decision trees are non-parametric and involve a non-convex optimization to choose a tree structure and parameters associated with nodes and leaves
    • Greedy heuristics
  • The outputs of decision nodes are multiplied and form a conjunction: an example has to satisfy all the conditions to belong to a leaf region.
  • A decision tree needs a separate leaf node to properly model each such variation, and at least one training example for each leaf node.
  • The generalization performance of decision trees degrades when the number of variations in the target function increases.

Ensembles of trees and forests add a 3rd level to the architecture which allows the model to discriminate among a number of regions exponential in the number of parameters. They implicitly form a distributed representation.

Kolmogorov complexity is the length of the smallest string that represents the solution, in some programming language.

  • It is not computable, but it can be bounded from above.
  • Upper bounds on Kolmogoriv complexity can be optimized.

Learning theory shows that if a compact description can be found to summarize the training set, good generalization is to be expected.

A good predictive model (in terms of out-of-sample log-likelihood) is also one that can assign a short code to each example, on average, including not only the bits to describe each example but also the bits necessary to describe the model itself.

Learning distributed representations

  • For the same number of possible configurations, a distributed representation can potentially be exponentially more compact than a very local one.
  • In a distributed representation the input pattern is represented by a set of features that are not mutually exclusive, and might even be statistically independent.
  • In the realm of supervised learning: Multi-layer neural networks and boltzmann machine are with the goal of learning distributed internal representations in the hidden layers
  • Gradient-based training of deep supervised multi-layer neural networks gets stuck in local minima or plateaus.
  • In a gradient-trained deep supervised neural network with random parameter initialization, the lower layers (closer to inputs) are poorly optimized. For those lower layers, these deterministic and continuous representations generally keep most of the information about the input, but these representations might hurt rather than help the top layers to perform classification. Thus, what matters for good generalization, and is more difficult, is the optimization of the lower layers (excluding the last one or two).

Other comments

  • It would be interesting to consider extensions of decision tree and boosting algorithms to multiple levels.
  • Prefer sparse representation not dimensionality reduction
  • Unsupervised learning is crucial.
  • Unsupervised learning is less prone to overfitting than supervised learning.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s