[25.07.12] Mastering the game of Go with deep neural networks and tree search - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: "Mastering the game of Go with deep neural networks and tree search" (AlphaGo). The discussion also covers its successors, "Mastering the Game of Go without Human Knowledge" (AlphaGo Zero) and "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play" (AlphaZero).
Authors: David Silver, Aja Huang, Chris J. Maddison, et al.
Published In: Nature
Year: 2016 (The discussion covers the entire series from 2016-2018).
Link: DOI: 10.1038/nature16961
Date of Discussion: 2025.07.12

Summary

Research Problem: To develop an AI that can master the game of Go, a long-standing grand challenge due to its enormous search space (~250¹⁵⁰) and the difficulty of evaluating board positions, and defeat top-ranked human professional players.
Key Contributions:
- Introduced a novel algorithm, AlphaGo, that combines deep neural networks with Monte Carlo Tree Search (MCTS).
- Demonstrated a pipeline that trains a policy network to select moves and a value network to evaluate positions.
- Showcased a progression of models that became increasingly general and powerful:
  - AlphaGo: Used supervised learning (SL) on human expert games and reinforcement learning (RL) from self-play.
  - AlphaGo Zero: Eliminated the need for human data, learning purely through self-play from a random state. It also merged the policy and value networks into a single, more efficient network.
  - AlphaZero: Generalized the algorithm to master other games (Chess, Shogi) with no game-specific knowledge beyond the basic rules.
Methodology/Approach:
1. Supervised Learning (SL) Policy Network: A 13-layer convolutional neural network was trained on 30 million positions from a server of expert human games (KGS) to predict human moves.
2. Reinforcement Learning (RL) Policy Network: The SL network was refined through RL, playing games against previous versions of itself to optimize for winning, not just mimicking human moves.
3. Value Network: A separate network was trained to predict the outcome (win/loss) from any given board position. To prevent overfitting, it was trained on a large dataset of 30 million distinct positions sampled from separate self-play games.
4. Monte Carlo Tree Search (MCTS): During gameplay, MCTS is used for lookahead search. The policy network prunes the search space by suggesting promising moves, and the value network evaluates leaf nodes to truncate the search depth, making the search far more efficient than previous methods.
Results: The original AlphaGo defeated European champion Fan Hui 5-0 and world champion Lee Sedol 4-1. The discussion notes that its successors, AlphaGo Zero and AlphaZero, achieved even higher levels of performance, decisively beating all previous versions and demonstrating a path to superhuman ability without human guidance.

Discussion Points

Strengths:
- The core idea was novel and effectively combined existing concepts (deep learning, RL, MCTS) to solve a monumental problem (02:05).
- The evolution of the algorithm towards greater simplicity and generality (from AlphaGo to AlphaZero) was a significant achievement. The models became less complex (e.g., single network, no rollouts) but more powerful (01:23, 31:29).
Weaknesses:
- The initial AlphaGo version was noted as being computationally inefficient and not well-optimized (19:23).
- The nature of its "intelligence" is questionable; it relies on massive search rather than human-like intuition or abstract reasoning.
Key Questions:
- Is this true reasoning? The central philosophical question was whether AlphaGo's process constitutes "reasoning". The speaker contrasts its massive search with human cognition, suggesting it's a fundamentally different, non-human approach (40:49, 43:20).
- Why can't LLMs play board games well? Given the advances in LLMs, the speaker questioned why they perform so poorly on structured, rule-based games compared to specialized algorithms like AlphaGo (44:15).
Applications: The discussion pivoted from Go to broader AGI research, particularly the ARC (Abstraction and Reasoning Corpus) challenge, questioning if the AlphaGo paradigm offers insights for solving problems that require generalizing from a few examples (45:38).
Connections: The speaker connected the findings to the limitations of current AI paradigms. They contrasted AlphaGo's search-based "brute force" with the pattern-matching nature of LLMs and the need for more abstract, generalizable reasoning, as explored in the ARC challenge.

Notes and Reflections

Interesting Insights:
- The models became simpler over time, suggesting that removing human priors (like expert moves) can unlock a higher performance ceiling (32:36).
- A key technical detail was generating a massive, diverse dataset of 30 million positions for training the value network to prevent overfitting on the highly correlated states within a single game (12:17).
- The use of board symmetries (rotation, reflection) to augment the training data 8-fold was a simple but effective technique in the original AlphaGo (21:30).
Lessons Learned: The speaker's perspective on the path to AGI evolved. They expressed growing skepticism towards approaches like program synthesis (related to ARC) and a reluctant acceptance that scaling might be the most viable, if imperfect, path forward ("It seems scaling is the answer, no matter how I look at it") (52:32). However, they still believe the current paradigm is fundamentally flawed and not sustainable.
Future Directions: The discussion concluded with a desire to explore the gap between AI performance and human-like understanding. The speaker identified mechanistic interpretability as a crucial research area to understand why models work or fail, and to build systems that can reason more flexibly like humans (53:57).