Raw performance - SanchoGGP/ggp-base GitHub Wiki

Thoughts on improving raw performance (i.e. making things faster rather than smarter).

Reading

See the Parallel MCTS section of the Papers page.

Terminology

Root Parallelization: Each thread creates its own independent MCTS tree. When the move must be submitted, stats for the immediate children of the root are combined to give the final result.
Leaf Parallelization: A master thread does the select & (usually) expand phases. Other threads do the rollouts and pass results to the master thread for doing updates. Usually combined with Coulom's "Virtual Losses".
Tree Parallelization: Each thread does the full MCTS algorithm, all sharing the same tree. Usually, locks are used to prevent interference. Some lock-free variants exist.
Virtual Losses: Usually used with leaf parallelization, virtual losses are the practice of recording a loss in each node during the selection pass. In the update pass, the result is corrected. This means that if multiple select passes as performed before the first asynchronous results become available, the later select passes are likely to choose a different path through the tree. (Without this technique, they would be guaranteed to choose the same path.)

Paper Reviews

See Papers for full details, authors, links, etc..

On the Parallelization of UCT

An early paper, most of which has been superseded.

3 methods of doing distributed MCTS.

Root parallelization
Root parallelization with occasional cross-thread updates
Leaf parallelization done badly (w/o virtual losses and with all threads doing rollouts from the same node and the master thread waiting for results)

Results were fairly flat across the different methods.

Using method (1) gives nearly all of the benefits.
Method (2) is better than 1 when the number of iterations in increased from 3,000 to 10,000.
Method (3) lies between (1) and (2).

Scalable Distributed Monte-Carlo Tree Search

Not finished yet. Key takeaway so far is that it's worth reading "A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm".

Two other interesting points so far...

It might be worth not expanding children of a node until the node has had at least N visits (with suggested N=4 for Go). Issue #340 covers this.
Depth-first UCT is the main contribution of this paper. This means that updates only go back up the tree to the point that they would have made a difference to the selection path. The next select works its way down from that point. Results are aggregated when eventually propagated up. This considerably reduced contention at the root.

Performance results

August 2015

Performance results immediately after IGGPC15.

Running on ARR Linux server with...

java -XX:+UseG1GC -Xmx5g -d64 -jar Perf.jar -statemachine -gamesearcher -time10 -repeat3 -varythreads englishDraughts hex dotsAndBoxes speedChess checkers reversi cephalopodMicro nineBoardTicTacToe pentago connectFour

Y-axis is rollouts/s. X-axis is number of CPU-intensive threads. The results for 1 CPU-intensive thread are for state machine only. All other results are for the game searcher (running with 1 tree thread and N-1 rollout threads).

See #355 for analysis.

Running on ARR Windows desktop with...

-statemachine -gamesearcher -varythreads -time10 -repeat5 englishDraughts hex dotsAndBoxes speedChess checkers reversi cephalopodMicro nineBoardTicTacToe pentago connectFour

May 2015 - Sancho vs Galvanise

Performance snapshot from 15th May 2015.

State machine

Single-threaded, state machine only, performance of Sancho & Galvanise.

Game	Sancho	Galvanise	S % of G
Hex	1516	1374	110
Reversi	2158	1614	134
Speed Chess	2238	3269	68
Checkers	3902	4464	87
English Draughts	6146	5569	110
Pentago	9563	9179	104
Ceph Micro	12748	11410	112
Dots & Boxes	21707	21451	101
9-board TTT	29112	19949	146
C4	83860	78686	107

Game searcher

Testing the full game tree searcher (multi-threaded) on Dots and Boxes & Connect 4, Galvanise gets 400 - 500% the performance of Sancho. Sancho was measured using 4 threads (1 tree thread + 3 rollout threads). Galvanise uses 1 tree thread + a variable number of rollout threads (3 for 1 game, 5 for another).

Game	Sancho	Galvanise	S % of G
Hex	4580
Reversi	4901
Speed Chess	6058
Checkers	13913
English Draughts	19120
Pentago	23754
Ceph Micro	23246
Dots & Boxes	17921	90000	20
9-board TTT	23749
C4	35609	160000	22

Hardware specs

ARR Windows desktop

1 x Intel Core i5 3470, quad-core (no hyper-threading) with 8GB RAM. Passmark = 6,607.

ARR Windows laptop

My laptop seldom has inbound internet connectivity.

1 x Intel Core i7 3540M, dual-core with 8GB RAM. 4 threads (1 CPU * 2 cores/CPU * 2 threads/core). Passmark = 4,646.

ARR Linux server

I have occasional access to a Linux server, with no inbound internet connectivity.

2 x Intel Xeon L5638, hexa-core with 24GB RAM. Total system has 24 threads (2 CPUs * 6 cores/CPU * 2 threads/core). Passmark (1 CPU) = 5,956. Implicit 2-CPU score is c. 12K.

SD Windows desktop

1 x Intel Core i7 ????, quad-core with 24GB RAM. 8 threads (1 CPU * 4 cores/CPU * 2 threads/core). Passmark = ?,???.