Original Nature paper: “Mastering the game of Go with deep neural networks and tree search”.
Game of GO is the most challenging of classic games for AI:
* enormous search space
* difficulty of evaluating board positions and moves
* use value network to evaluate board positions
* use policy networks to select moves
NN are trained by a combination of:
* supervised learning from human expert games
* reinflearning from games of self-play
introduced a new search algo
* combines Monte Carlo simulation with value and policy newtorks
* achieved a 99.8% winning rate against other Go programs
* defeated human European Go champion by 5 games to 0
* the first time a computer program has defeated a human professional player in full-sized game of Go
‘position’ means the entire board status at time t.
A search tree containing approximately b^d possible sequences of moves
* b is game’s breadth (number of leagal moves per position)
* d is its depth (game length)
* chess: b~=35, d~=80
* Go: b~=250, d~=150
effective search space can be reduced by two general principles:
* depth of search may be reduced by position evaluation
* truncating search tree at state s and replacing subtree below s by an approximate value function v(s) ~= v'(s) that predicts the outcome from state s.
* led to superhuman performance in chess, checkers, and othello
* but intractable in Go due to complexity of game
* breadth of search may be reduced by sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s
Monte Carlo tree search (MCTS)
* use Monte Carlo rollouts to estimate the value of each state in a search tree
* as more simulations are executed, the search tree grows larger and the relavant values become more accurate.
* the policy used to select actions during search is also improved over time, by selecting children with higher values.
* asymptotically, this policy converges to optimal play, and the evaluations converge to optimal value function
* the strongest current Go programs are based on MCTS, enhanced by policies that trained to predict human expert moves
* these policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts
* this approach has achieved strong amateur play
* but prior work has been limited to shallow policies or value functions based on a linear combination of input features
* pass in board position as a 19×19 image and use conv layers to build a representation of position
* use nn to reduce effective depth and breadth of search tree
* evaluating positions using a value network (reduce depth)
* sampling actions using a policy network (reduce breadth)
* first, train a supervised learning (SL) policy network p1 directly from expert human moves
* this provides fast, efficient learning updates with immediate feedback and high-quality gradients
* train a fast policy p2 that can rapidly sample actions during rollouts
* train a reinflearning (RL) policy network p3 using policy gradient
* improves SL policy network p1 by optimizing the final outcome of games of self-play
* adjust the policy towards the correct goal of winning games, rather than maximizing predictive accuracy
* train a value network v1 that predicts the winner of games played by RL policy network p3 against itself
* alphago combines the RL policy p3 and value networks v1 with MCTS.