[25.04.17] Solving Olympiad Geometry Without Human Demonstrations (AlphaGeometry) - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: Solving Olympiad Geometry Without Human Demonstrations
Authors: Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He & Thang Luong
Published In: Nature
Year: 2024 (Published online: 17 January 2024)
Link: https://doi.org/10.1038/s41586-023-06747-5
Date of Discussion: 2025.04.17

Summary

Research Problem: Automated theorem proving for Olympiad-level geometry is challenging due to the scarcity of machine-verifiable human proofs (data bottleneck) and the difficulty of finding necessary auxiliary geometric constructions (infinite branching factor).
Key Contributions: Introduced AlphaGeometry, a neuro-symbolic system that trains a language model (LM) from scratch on 100 million synthetically generated geometry theorems and proofs, bypassing the need for human data. It significantly outperforms previous state-of-the-art methods on Olympiad problems and produces human-readable proofs.
Methodology/Approach: AlphaGeometry uses an LM (Transformer) to guide a symbolic deduction engine (DD+AR) in proof search. The LM proposes potentially useful auxiliary constructions, while the symbolic engine performs logical deductions and algebraic reasoning to verify steps and expand the proof state. Synthetic data is generated by sampling random premises, running the symbolic engine to find deductions, and using traceback to extract minimal proofs, identifying auxiliary steps needed. Proof search uses a parallelized beam search.
Results: Solved 25/30 recent Olympiad geometry problems (IMO-AG-30 benchmark), compared to 10/30 for the previous best (Wu's method) and 0/30 for unaided GPT-4. Showed robustness in ablation studies (less data, smaller beam size). Proofs were validated by a human expert.

Discussion Points

Strengths:
- Highly effective despite the conceptual simplicity of the LM guiding a symbolic engine (Attendee 1: "Ockham's Razor").
- Successfully overcomes the data scarcity problem via massive synthetic data generation (Both attendees).
- Achieves state-of-the-art results, approaching human expert level (Paper results, discussion acknowledgment).
- Avoids human biases by using purely synthetic data (Attendee 1).
- Robust performance even with significantly reduced training data (20%) or search budget (beam size 8 vs 512) (Both attendees noted this).
- The approach seems generalizable and potentially scalable (Attendee 1, Attendee 2).
Weaknesses:
- Extremely high computational cost for data generation (est. 100k CPU core-hours, ~$500k) and search (10k CPUs per problem) (Attendee 1 calculation and discussion).
- The core neuro-symbolic loop idea might not be entirely novel, resembling generate-and-verify paradigms (Attendee 1).
- The full configuration might be "overkill" given strong results with reduced settings, suggesting inefficiency (Both attendees discussed ablations).
- Raises questions about whether it achieves "true" creative reasoning or just performs highly effective, guided search (Attendee 1).
Key Questions:
- How applicable is this synthetic data generation and neuro-symbolic framework to less formalized or different domains like the Abstraction and Reasoning Corpus (ARC)? (Attendee 1).
- What are the main challenges in implementing the required components (definitions, sampler, symbolic engine, traceback) for new domains? (Attendee 1 identified sampler and engine as hard for ARC).
- Can similar performance be achieved more efficiently? (Implicit in discussion of cost and overkill).
Applications:
- Automated theorem proving in other mathematical fields (Paper, discussion).
- Potential approach for AI challenges like ARC (Attendee 1).
- General framework for problems requiring large search space exploration combined with rigorous verification (Attendee 2's "recipe" idea).
Connections:
- System 1 (LM intuition/proposal) / System 2 (Symbolic verification/reasoning) analogy (Discussed by both attendees).
- Reinforcement Learning / Self-Correction (generating data, learning from it) (Attendee 1).
- Contrasted with traditional heuristic-based search methods (Paper, Attendee 2).
- Fits a pattern of combining Transformers with symbolic verifiers/tools across AI domains (Attendee 2).

Notes and Reflections

Interesting Insights:
- The sheer scale and cost of synthetic data generation employed (100M examples).
- The LM effectively functions as a learned, general heuristic for the complex task of proposing auxiliary constructions.
- The stark failure of general models like GPT-4 (0/30) underscores the need for specialized, grounded systems for such rigorous tasks.
- The system rediscovered known theorems and even generalized an IMO theorem through its automated process.
Lessons Learned:
- Large-scale synthetic data can be a viable alternative to human data in structured domains, overcoming data limitations.
- Neuro-symbolic systems combining learned heuristics (LM) with symbolic reasoning/verification are powerful for complex search and reasoning tasks.
- Careful problem definition and a capable symbolic engine are critical components for this approach.
Future Directions:
- Apply the framework to other data-scarce, structured domains.
- Investigate methods to improve the efficiency of data generation and proof search.
- Enhance the symbolic engine with higher-level rules/lemmas to potentially improve performance and proof readability.
- Further explore the nature of the reasoning learned – is it deep understanding or sophisticated pattern matching and search?