[25.05.17] Mathematical discoveries from program search with large language models - Paper-Reading-Study/2025 GitHub Wiki
Paper Reading Study Notes
General Information
Paper Title: Mathematical discoveries from program search with large language models
Authors: Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli & Alhussein Fawzi
Published In: Nature
Year: 2024 (Published online 14 December 2023)
Link:https://doi.org/10.1038/s41586-023-06924-6
Date of Discussion: 2025.05.17
Summary
Research Problem: How to leverage Large Language Models (LLMs) to discover novel, verifiably correct programs for complex mathematical and algorithmic problems, particularly those that are "easy to evaluate" but "hard to solve."
Key Contributions:
Introduction of FunSearch, an evolutionary procedure pairing a pretrained LLM with a systematic evaluator.
Demonstration of FunSearch's ability to discover new, state-of-the-art constructions for the cap set problem (e.g., for n=8 and improving the asymptotic lower bound).
Discovery of new, improved heuristics for the online bin packing problem.
Emphasis on generating programs that describe solutions, leading to interpretability and conciseness, rather than just raw solutions.
Methodology/Approach:
FunSearch starts with a user-provided "specification": an evaluate function and an initial program "skeleton" (often with a specific function like priority to be evolved).
A pretrained LLM (Codey in this case) proposes creative modifications to the designated part of the program.
An "evaluator" scores these newly generated programs based on their performance on relevant inputs.
An "island-based" evolutionary algorithm manages a diverse database of programs. It samples promising programs (favoring higher-scoring and shorter ones) and uses "best-shot prompting" (feeding k=2 good examples, v0 and v1, back to the LLM) to guide further generation.
The system is designed to be distributed, asynchronous, and parallel, making it scalable and cost-effective.
Results:
Cap Sets: Discovered a cap set of size 512 for n=8 (surpassing the previously known 496). Found new admissible sets leading to an improved lower bound on the cap set capacity (C ≥ 2.2202).
Bin Packing: Generated new heuristics for online bin packing that outperformed standard first-fit and best-fit methods, especially on Weibull-distributed item sizes. The discovered heuristics often involved a strategy of placing items in less-full bins unless the fit in the best-fit bin was very tight.
Ablation studies confirmed the importance of components like the evolutionary mechanism (islands) and the program skeleton.
Discussion Points
Strengths:
The core idea is relatively simple yet powerful (LLM + evaluator + evolution).
Generates interpretable and concise programs, facilitating human understanding and feedback.
Scalable and cost-effective due to its distributed architecture and ability to use off-the-shelf, even smaller, LLMs effectively.
Demonstrated ability to achieve novel, SOTA results in established open problems.
The evolutionary "island" model helps maintain diversity and avoid local optima.
The iterative nature ensures monotonic improvement in the best-found solution.
Weaknesses:
Relies on a human-provided program skeleton, which limits the search space and might introduce human bias.
The LLM, trained on human code, might have inherent biases that could hinder the discovery of truly "superhuman" or unconventional solutions.
Perceived as less "self-discovering" than systems like AlphaZero; more of an LLM-assisted, guided search.
Effectiveness is tied to problems having an efficient evaluator and providing "rich" (non-binary) scoring feedback.
Key Questions:
Why was the cap set problem chosen? Was it because it was perceived as less explored, or simply a good fit for the "easy-to-evaluate, hard-to-solve" paradigm? (The discussion leaned towards the latter, though one participant speculated about the former).
How significant is the LLM's human-like bias in practice? Can this approach truly transcend human intuition?
What is the precise rationale for the "best-shot prompting" structure (e.g., ordering priority_v0 before priority_v1 in the prompt)?
The meaning of "4h" for island resets in the evolutionary method – is it literal hours or a hyperparameter for iterations/epochs? (Likely a hyperparameter).
Applications:
Discovering new mathematical constructions, conjectures, or theorems.
Finding improved heuristics for combinatorial optimization problems (e.g., bin packing, job scheduling, logistics).
Potentially any scientific or engineering problem where solutions can be expressed as programs and evaluated efficiently.
Connections:
Genetic Programming: FunSearch is a form of genetic programming where the LLM acts as a sophisticated, context-aware mutation and crossover operator.
AlphaZero/MuZero: Shares the concept of search guided by a learned component, but FunSearch operates in program space with a pre-trained LLM, unlike AlphaZero's self-play and model learning.
AlphaEvolve: Mentioned as a potential successor or related work that likely builds on FunSearch's principles.
Automated Theorem Proving: Related in goal, but FunSearch's requirement for rich scoring feedback makes it less suitable for typical theorem proving tasks.
Notes and Reflections
Interesting Insights:
The LLM doesn't necessarily need deep "understanding"; it functions effectively as a generator of syntactically plausible and occasionally creative program variations.
The "island" model and evolutionary pressures are crucial for navigating the search space and fostering diversity.
Searching in program space is key to achieving interpretable and generalizable solutions. The system implicitly favors solutions with lower Kolmogorov complexity (shorter, more elegant programs), partly due to selection pressure for shorter code.
The specific choice of LLM (e.g., Codey vs. StarCoder) or even its size was not found to be overly critical for the reported results, suggesting the robustness of the overall FunSearch framework.
Lessons Learned:
Combining LLMs with structured search methodologies like evolutionary algorithms can unlock capabilities beyond what either component can achieve in isolation.
A well-defined problem specification, including an efficient evaluator and a suitable program skeleton, is paramount.
The evolutionary management of candidate programs (diversity, selection, culling) is as important as the generation mechanism itself.
Future Directions:
Reducing the reliance on detailed human-provided skeletons to allow for more open-ended discovery.
Applying FunSearch to a broader range of scientific discovery problems.
Investigating methods to better harness the capabilities of more powerful LLMs within the FunSearch framework.
Exploring the generation of more complex program structures, rather than just modifying a small, predefined part.
Further investigation into the "AlphaEvolve" paper was suggested as a next step.