At heart, Bayes’s theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.

Things get interesting when you have many pieces of evidence. To combine them all without suffering a combinatorial explosion, we need to make simplifying assumptions.

For Bayesians, learning is “just” another application of Bayes’ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner.

Prior probability is prior to seeing any evidence. It reflects your a priori beliefs that what will happen, based on your general knowledge of the universe. Posterior probability is after seeing some evidence. The crucial question is how the posterior probability should evolve as you see more evidence.

Bayesians think that a probability is not a frequency but a subjective degree of belief. All that Bayesian inference lets you do it update your prior beliefs with new evidence to obtain your posterior belief.

Bayesians’ devotion to this idea is near religious, enough to withstand 200 years of attacks and counting.

A learner that uses Bayes’ theorem and assumes the effects are independent given the cause is called a Naive Bayes classifier

ML is the art of making false assumptions and getting away with it. As the statistician George Box famously put it: All models are wrong, but some are useful.

The economist Milton Friedman argued that the best theories are the most oversimplified, provided their predictions are accurate, because they explain the most with the least.

Naive Bayes forms the basis of many spam filters: treating spam as a disease whose symptoms are the words in the email.

Naive Bayes is the most widely used learner in Google, said Peter Novig. Surprising accuracy aside, it scales great. Learning a Naive Bayes classifier is just a matter of counting how many times each attribute co-occurs with each class and takes barely longer than reading the data from disk.

Naive Bayes is closely related to perceptron algorithm. The perceptron adds weights, and naive bayes multiplies probabilities, but if you take a logarithm, the latter reduces to the former.

Bayesians algorithms:

- Naive Bayes
- Markov Chain
- Encodes the assumption that the future is conditionally independent of the past given the present

- PageRank algorithm is itself a markov chain. Markov chains are one of the most intensively studies topics in mathematics, but they’re still a very limited kind of probabilistic model.
- Hidden Markov Model (HMM)
- Assumes in addition that each observation depends only on the corresponding state

- Kalman filter
- Bayesian networks

Naive Bayes, PageRank, HMM, Kalman filter are all special cases of Bayesian networks

Bayesian networks are for Bayesians what logic is for symbolists.

Bayesian network can be viewed as a “generative model”, a recipe for probabilistically generating a state of the world: first decide independently whether there’s a burglary and/or an earthquake, then based on that decide whether the alarm goes off, and then based on that whether Bob and Claire call. A Bayesian network tells a story: A happened, and it led to B; at the same time, C also happened, and B and C together caused D. To compute the probability of a particular story, we just multiply the probabilities of all its different strands.

Use MCMC to do Bayesian inference. Bayesian learning is just another kind of probabilistic inference.

Markov chain Monte Carlo (MCMC) is to do a random walk, like the proverbial drunkard, jumping from state to state of the network in such a s way that, in the long run, the number of times each state is visited is proportional to its probability. The trick in MCMC is to design a Markov chain that converges to the distribution of our Bayesian network.

MCMC is often viewed as one of the most important algorithms of all time: it is good not just for computing probabilities but for integrating any function. However, MCMC is very slow to converge, or folls you by looking like it’s converged when it hasn’t.

Inference in Bayesian networks is not limited to computing probabilities. It also includes finding the most probable explanation for the evidence.

For a Bayesian, there is in fat no such thing as the truth; you have a prior distribution over hypotheses, after seeing the data it becomes the posterior distribution, and that’s all. This is a radical departure from the way science is usually done. It’s like saying: “actually, neither Copernicus nor Ptolemy was right; let’s just predict the planet’s future trajectories assuming Earth goes round the sun and vice versa and weighted-average the results.”

As the joke goes, being Bayesian means never having to say you’re certain.

Carrying around a multitude of hypotheses instead of just one is a huge pain. Luckily, most the the time, almost all hypotheses wind up with a tiny posterior probability, and we can safely ignore them. Just considering the single most probable hypothesis is usually a very good approximation.

If we’re willing to assume all hypotheses are equally likely a priori, the Bayesian approach reduces to the maximum likelihood principle.

If you’re a frequentist, you can only estimate probabilities of events that can occur more than once. But for a Bayesian, a probability is a subjective degree of belief, so he’s free to make an educated guess.

Symbolists don’t like probabilities and tell jokes like “how many Bayesians does it take to change a lightbulb? They’re not sure. Come to think of it, they’re not sure the lightbulb is burned out”. They complain:

- Inference becomes a lot more expensive
- All numbers are hard to understand
- Have to deal with priors
- Hordes of zombie hypotheses chase us around forever
- We don’t know how to put probability distributions on many things we need to learn
- Bayesian network is a distribution over a vector of variables, but what about distributions over networks, databases, knowledge bases, languages, plans, programs?

We need both logic and probability. Logic cannot deal with incomplete or noisy information, but Bayesian networks can handle. But unifying logic and probability is a much harder problem. Most experts believed that unifying logic and probability was impossible.