Michael Jordan’s View on Deep Learning

See original post on reddit.

Here are all his reply on reddit.

I highlighted interesting part in color. I agree strongly with the sentences in red, but not so much for those in purple.

OK, I guess that I have to say something about “deep learning”. This seems like as good a place as any (apologies, though, for not responding directly to your question).

“Deep” breath.

My first and main reaction is that I’m totally happy that any area of machine learning (aka, statistical inference and decision-making; see my other post 🙂 is beginning to make impact on real-world problems. I’m in particular happy that the work of my long-time friend Yann LeCun is being recognized, promoted and built upon. Convolutional neural networks are just a plain good idea.

I’m also overall happy with the rebranding associated with the usage of the term “deep learning” instead of “neural networks”. In other engineering areas, the idea of using pipelines, flow diagrams and layered architectures to build complex systems is quite well entrenched, and our field should be working (inter alia) on principles for building such systems. The word “deep” just means that to me—layering (and I hope that the language eventually evolves toward such drier words…). I hope and expect to see more people developing architectures that use other kinds of modules and pipelines, not restricting themselves to layers of “neurons”.

With all due respect to neuroscience, one of the major scientific areas for the next several hundred years, I don’t think that we’re at the point where we understand very much at all about how thought arises in networks of neurons, and I still don’t see neuroscience as a major generator for ideas on how to build inference and decision-making systems in detail. Notions like “parallel is good” and “layering is good” could well (and have) been developed entirely independently of thinking about brains.

I might add that I was a PhD student in the early days of neural networks, before backpropagation had been (re)-invented, where the focus was on the Hebb rule and other “neurally plausible” algorithms. Anything that the brain couldn’t do was to be avoided; we needed to be pure in order to find our way to new styles of thinking. And then Dave Rumelhart started exploring backpropagation—clearly leaving behind the neurally-plausible constraint—and suddenly the systems became much more powerful. This made an impact on me. Let’s not impose artificial constraints based on cartoon models of topics in science that we don’t yet understand.

My understanding is that many if not most of the “deep learning success stories” involve supervised learning (i.e., backpropagation) and massive amounts of data. Layered architectures involving lots of linearity, some smooth nonlinearities, and stochastic gradient descent seem to be able to memorize huge numbers of patterns while interpolating smoothly (not oscillating) “between” the patterns; moreover, there seems to be an ability to discard irrelevant details, particularly if aided by weight- sharing in domains like vision where it’s appropriate. There’s also some of the advantages of ensembling. Overall an appealing mix. But this mix doesn’t feel singularly “neural” (particularly the need for large amounts of labeled data).

Indeed, it’s unsupervised learning that has always been viewed as the Holy Grail; it’s presumably what the brain excels at and what’s really going to be needed to build real “brain-inspired computers”. But here I have some trouble distinguishing the real progress from the hype. It’s my understanding that in vision at least, the unsupervised learning ideas are not responsible for some of the recent results; it’s the supervised training based on large data sets.

One way to approach unsupervised learning is to write down various formal characterizations of what good “features” or “representations” should look like and tie them to various assumptions that seem to be of real-world relevance. This has long been done in the neural network literature (but also far beyond). I’ve seen yet more work in this vein in the deep learning work and I think that that’s great. But I personally think that the way to go is to put those formal characterizations into optimization functionals or Bayesian priors, and then develop procedures that explicitly try to optimize (or integrate) with respect to them. This will be hard and it’s an ongoing problem to approximate. In some of the deep learning learning work that I’ve seen recently, there’s a different tack—one uses one’s favorite neural network architecture, analyses some data and says “Look, it embodies those desired characterizations without having them built in”. That’s the old-style neural network reasoning, where it was assumed that just because it was “neural” it embodied some kind of special sauce. That logic didn’t work for me then, nor does it work for me now.

Lastly, and on a less philosophical level, while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on? (5) How can I do diagnostics so that I don’t roll out a system that’s flawed or so that I can figure out that an existing system is now broken? (6) How do I deal with non-stationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

Although I could possibly investigate such issues in the context of deep learning ideas, I generally find it a whole lot more transparent to investigate them in the context of simpler building blocks.

Based on seeing the kinds of questions I’ve discussed above arising again and again over the years I’ve concluded that statistics/ML needs a deeper engagement with people in CS systems and databases, not just with AI people, which has been the main kind of engagement going on in previous decades (and still remains the focus of “deep learning”). I’ve personally been doing exactly that at Berkeley, in the context of the “RAD Lab” from 2006 to 2011 and in the current context of the “AMP Lab”.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s