Raw performance - SanchoGGP/ggp-base GitHub Wiki
Thoughts on improving raw performance (i.e. making things faster rather than smarter).
Reading
See the Parallel MCTS section of the Papers page.
Terminology
-
Root Parallelization: Each thread creates its own independent MCTS tree. When the move must be submitted, stats for the immediate children of the root are combined to give the final result.
-
Leaf Parallelization: A master thread does the select & (usually) expand phases. Other threads do the rollouts and pass results to the master thread for doing updates. Usually combined with Coulom's "Virtual Losses".
-
Tree Parallelization: Each thread does the full MCTS algorithm, all sharing the same tree. Usually, locks are used to prevent interference. Some lock-free variants exist.
-
Virtual Losses: Usually used with leaf parallelization, virtual losses are the practice of recording a loss in each node during the selection pass. In the update pass, the result is corrected. This means that if multiple select passes as performed before the first asynchronous results become available, the later select passes are likely to choose a different path through the tree. (Without this technique, they would be guaranteed to choose the same path.)
Paper Reviews
See Papers for full details, authors, links, etc..
On the Parallelization of UCT
An early paper, most of which has been superseded.
3 methods of doing distributed MCTS.
- Root parallelization
- Root parallelization with occasional cross-thread updates
- Leaf parallelization done badly (w/o virtual losses and with all threads doing rollouts from the same node and the master thread waiting for results)
Results were fairly flat across the different methods.
- Using method (1) gives nearly all of the benefits.
- Method (2) is better than 1 when the number of iterations in increased from 3,000 to 10,000.
- Method (3) lies between (1) and (2).
Scalable Distributed Monte-Carlo Tree Search
Not finished yet. Key takeaway so far is that it's worth reading "A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm".
Two other interesting points so far...
- It might be worth not expanding children of a node until the node has had at least N visits (with suggested N=4 for Go). Issue #340 covers this.
- Depth-first UCT is the main contribution of this paper. This means that updates only go back up the tree to the point that they would have made a difference to the selection path. The next select works its way down from that point. Results are aggregated when eventually propagated up. This considerably reduced contention at the root.
Performance results
August 2015
Performance results immediately after IGGPC15.
Running on ARR Linux server with...
java -XX:+UseG1GC -Xmx5g -d64 -jar Perf.jar -statemachine -gamesearcher -time10 -repeat3 -varythreads englishDraughts hex dotsAndBoxes speedChess checkers reversi cephalopodMicro nineBoardTicTacToe pentago connectFour
Y-axis is rollouts/s. X-axis is number of CPU-intensive threads. The results for 1 CPU-intensive thread are for state machine only. All other results are for the game searcher (running with 1 tree thread and N-1 rollout threads).
See #355 for analysis.
Running on ARR Windows desktop with...
-statemachine -gamesearcher -varythreads -time10 -repeat5 englishDraughts hex dotsAndBoxes speedChess checkers reversi cephalopodMicro nineBoardTicTacToe pentago connectFour
May 2015 - Sancho vs Galvanise
Performance snapshot from 15th May 2015.
State machine
Single-threaded, state machine only, performance of Sancho & Galvanise.
Game | Sancho | Galvanise | S % of G |
---|---|---|---|
Hex | 1516 | 1374 | 110 |
Reversi | 2158 | 1614 | 134 |
Speed Chess | 2238 | 3269 | 68 |
Checkers | 3902 | 4464 | 87 |
English Draughts | 6146 | 5569 | 110 |
Pentago | 9563 | 9179 | 104 |
Ceph Micro | 12748 | 11410 | 112 |
Dots & Boxes | 21707 | 21451 | 101 |
9-board TTT | 29112 | 19949 | 146 |
C4 | 83860 | 78686 | 107 |
Game searcher
Testing the full game tree searcher (multi-threaded) on Dots and Boxes & Connect 4, Galvanise gets 400 - 500% the performance of Sancho. Sancho was measured using 4 threads (1 tree thread + 3 rollout threads). Galvanise uses 1 tree thread + a variable number of rollout threads (3 for 1 game, 5 for another).
Game | Sancho | Galvanise | S % of G |
---|---|---|---|
Hex | 4580 | ||
Reversi | 4901 | ||
Speed Chess | 6058 | ||
Checkers | 13913 | ||
English Draughts | 19120 | ||
Pentago | 23754 | ||
Ceph Micro | 23246 | ||
Dots & Boxes | 17921 | 90000 | 20 |
9-board TTT | 23749 | ||
C4 | 35609 | 160000 | 22 |
Hardware specs
ARR Windows desktop
1 x Intel Core i5 3470, quad-core (no hyper-threading) with 8GB RAM. Passmark = 6,607.
ARR Windows laptop
My laptop seldom has inbound internet connectivity.
1 x Intel Core i7 3540M, dual-core with 8GB RAM. 4 threads (1 CPU * 2 cores/CPU * 2 threads/core). Passmark = 4,646.
ARR Linux server
I have occasional access to a Linux server, with no inbound internet connectivity.
2 x Intel Xeon L5638, hexa-core with 24GB RAM. Total system has 24 threads (2 CPUs * 6 cores/CPU * 2 threads/core). Passmark (1 CPU) = 5,956. Implicit 2-CPU score is c. 12K.
SD Windows desktop
1 x Intel Core i7 ????, quad-core with 24GB RAM. 8 threads (1 CPU * 4 cores/CPU * 2 threads/core). Passmark = ?,???.