Raw performance - SanchoGGP/ggp-base GitHub Wiki

Thoughts on improving raw performance (i.e. making things faster rather than smarter).

Reading

See the Parallel MCTS section of the Papers page.

Terminology

  • Root Parallelization: Each thread creates its own independent MCTS tree. When the move must be submitted, stats for the immediate children of the root are combined to give the final result.

  • Leaf Parallelization: A master thread does the select & (usually) expand phases. Other threads do the rollouts and pass results to the master thread for doing updates. Usually combined with Coulom's "Virtual Losses".

  • Tree Parallelization: Each thread does the full MCTS algorithm, all sharing the same tree. Usually, locks are used to prevent interference. Some lock-free variants exist.

  • Virtual Losses: Usually used with leaf parallelization, virtual losses are the practice of recording a loss in each node during the selection pass. In the update pass, the result is corrected. This means that if multiple select passes as performed before the first asynchronous results become available, the later select passes are likely to choose a different path through the tree. (Without this technique, they would be guaranteed to choose the same path.)

Paper Reviews

See Papers for full details, authors, links, etc..

On the Parallelization of UCT

An early paper, most of which has been superseded.

3 methods of doing distributed MCTS.

  1. Root parallelization
  2. Root parallelization with occasional cross-thread updates
  3. Leaf parallelization done badly (w/o virtual losses and with all threads doing rollouts from the same node and the master thread waiting for results)

Results were fairly flat across the different methods.

  • Using method (1) gives nearly all of the benefits.
  • Method (2) is better than 1 when the number of iterations in increased from 3,000 to 10,000.
  • Method (3) lies between (1) and (2).

Scalable Distributed Monte-Carlo Tree Search

Not finished yet. Key takeaway so far is that it's worth reading "A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm".

Two other interesting points so far...

  • It might be worth not expanding children of a node until the node has had at least N visits (with suggested N=4 for Go). Issue #340 covers this.
  • Depth-first UCT is the main contribution of this paper. This means that updates only go back up the tree to the point that they would have made a difference to the selection path. The next select works its way down from that point. Results are aggregated when eventually propagated up. This considerably reduced contention at the root.

Performance results

August 2015

Performance results immediately after IGGPC15.

Running on ARR Linux server with...

java -XX:+UseG1GC -Xmx5g -d64 -jar Perf.jar -statemachine -gamesearcher -time10 -repeat3 -varythreads englishDraughts hex dotsAndBoxes speedChess checkers reversi cephalopodMicro nineBoardTicTacToe pentago connectFour

Y-axis is rollouts/s. X-axis is number of CPU-intensive threads. The results for 1 CPU-intensive thread are for state machine only. All other results are for the game searcher (running with 1 tree thread and N-1 rollout threads).

See #355 for analysis.

Running on ARR Windows desktop with...

-statemachine -gamesearcher -varythreads -time10 -repeat5 englishDraughts hex dotsAndBoxes speedChess checkers reversi cephalopodMicro nineBoardTicTacToe pentago connectFour

May 2015 - Sancho vs Galvanise

Performance snapshot from 15th May 2015.

State machine

Single-threaded, state machine only, performance of Sancho & Galvanise.

Game Sancho Galvanise S % of G
Hex 1516 1374 110
Reversi 2158 1614 134
Speed Chess 2238 3269 68
Checkers 3902 4464 87
English Draughts 6146 5569 110
Pentago 9563 9179 104
Ceph Micro 12748 11410 112
Dots & Boxes 21707 21451 101
9-board TTT 29112 19949 146
C4 83860 78686 107

Game searcher

Testing the full game tree searcher (multi-threaded) on Dots and Boxes & Connect 4, Galvanise gets 400 - 500% the performance of Sancho. Sancho was measured using 4 threads (1 tree thread + 3 rollout threads). Galvanise uses 1 tree thread + a variable number of rollout threads (3 for 1 game, 5 for another).

Game Sancho Galvanise S % of G
Hex 4580
Reversi 4901
Speed Chess 6058
Checkers 13913
English Draughts 19120
Pentago 23754
Ceph Micro 23246
Dots & Boxes 17921 90000 20
9-board TTT 23749
C4 35609 160000 22

Hardware specs

ARR Windows desktop

1 x Intel Core i5 3470, quad-core (no hyper-threading) with 8GB RAM. Passmark = 6,607.

ARR Windows laptop

My laptop seldom has inbound internet connectivity.

1 x Intel Core i7 3540M, dual-core with 8GB RAM. 4 threads (1 CPU * 2 cores/CPU * 2 threads/core). Passmark = 4,646.

ARR Linux server

I have occasional access to a Linux server, with no inbound internet connectivity.

2 x Intel Xeon L5638, hexa-core with 24GB RAM. Total system has 24 threads (2 CPUs * 6 cores/CPU * 2 threads/core). Passmark (1 CPU) = 5,956. Implicit 2-CPU score is c. 12K.

SD Windows desktop

1 x Intel Core i7 ????, quad-core with 24GB RAM. 8 threads (1 CPU * 4 cores/CPU * 2 threads/core). Passmark = ?,???.