IMO25 - chunhualiao/public-docs GitHub Wiki
iterative problem solving and verification using LLM
https://github.com/lyang36/IMO25
This code implements an automated IMO (International Mathematical Olympiad) problem solver using OpenAI's API. Here's a comprehensive explanation:
graph TD
A[Start: Read Problem File] --> B[Generate Initial Solution]
B --> C[Self-Improvement Step]
C --> D[Verify Solution]
D --> E{Verification Passes?}
E -->|Yes| F[Increment Correct Count]
E -->|No| G[Increment Error Count<br/>Reset Correct Count]
F --> H{Correct Count >= 5?}
G --> I{Error Count >= 10?}
H -->|Yes| J[✅ SUCCESS: Return Solution]
H -->|No| K[Continue Verification Loop]
I -->|Yes| L[❌ FAILURE: Return None]
I -->|No| M[Generate Corrected Solution<br/>Using Bug Report]
K --> D
M --> D
N{Max Iterations<br/>Reached?} --> O[❌ TIMEOUT FAILURE]
style J fill:#90EE90
style L fill:#FFB6C1
style O fill:#FFB6C1
subgraph "Verification Process"
V1[Extract Detailed Solution]
V2[Send to Verification Prompt]
V3[Get Verification Response]
V4[Check if Solution is Correct]
V5[Generate Bug Report if Needed]
V1 --> V2 --> V3 --> V4 --> V5
end
subgraph "API Interaction"
API1[Build Request Payload]
API2[Send HTTP Request]
API3[Extract Response Text]
API1 --> API2 --> API3
end
The script uses an iterative approach with AI to solve complex mathematical problems by generating solutions, verifying them, and refining them until a correct solution is found.
- Uses OpenAI's o3 model (configured as "gpt-5")
- Custom logging system that writes to both stdout and log files
- API key management through environment variables
The system uses several specialized prompts:
-
step1_prompt
: Instructs the AI to provide rigorous, complete mathematical solutions -
verification_system_prompt
: Acts as an expert grader to verify solutions -
self_improvement_prompt
: Asks the AI to review and improve its solution -
correction_prompt
: Provides bug reports for solution refinement
def build_request_payload(system_prompt, question_prompt, other_prompts=None)
def send_api_request(api_key, payload)
def extract_text_from_response(response_data)
These handle communication with OpenAI's o3 API, formatting requests and parsing responses.
def verify_solution(problem_statement, solution, verbose=True)
- Extracts the detailed solution portion
- Sends it to a verification prompt that acts as an expert IMO grader
- Returns a bug report if issues are found
- Uses a secondary check to determine if the solution is truly correct
def agent(problem_statement, other_prompts=[])
This orchestrates the entire solving process:
- Initial Exploration: Generates first solution attempt
-
Iterative Refinement Loop (max 30 iterations):
- If verification fails: Generate corrected solution using bug report
- If verification passes: Increment success counter
- Success condition: 5 consecutive successful verifications
- Failure condition: 10 consecutive errors
The system requires 5 consecutive successful verifications to ensure the solution is robust and not just lucky. This helps avoid false positives where a flawed solution might accidentally pass verification once.
python solver.py problem_statement.txt --log results.log --max_runs 10
- Rigorous Verification: Uses a separate AI instance as an "expert grader" to catch errors
- Iterative Improvement: Automatically refines solutions based on feedback
- Robust Success Detection: Requires multiple successful verifications
- Comprehensive Logging: Tracks all interactions and timestamps
- Error Handling: Graceful failure modes and retry mechanisms
The system essentially creates a feedback loop where an AI solver and AI verifier work together to produce mathematically rigorous solutions to complex problems.
Prompt Name | Purpose | When Used | Key Requirements | Why Needed |
---|---|---|---|---|
step1_prompt |
Main solution generation | Initial solution & corrections | - Rigor is paramount - Complete solutions only - TeX formatting - Structured output (Summary + Detailed Solution) |
Establishes the foundational requirements for mathematical rigor. Ensures solutions aren't just correct answers but properly justified proofs. |
self_improvement_prompt |
Solution refinement | After initial solution generation | - Review and correct errors - Fill justification gaps - Follow system prompt structure |
Gives the AI a chance to catch its own mistakes before external verification, improving solution quality. |
verification_system_prompt |
Solution verification/grading | Every solution verification cycle | - Act as IMO expert grader - Find all issues - Classify errors vs gaps - Step-by-step verification |
Creates an independent "judge" to rigorously evaluate solutions, preventing the solver from being overly lenient with itself. |
verification_reminder |
Verification task context | Appended to verification requests | - Generate summary and verification log - Justify each step |
Ensures the verification AI understands its specific task within the broader context. |
correction_prompt |
Bug report integration | When verification fails | - Address specific issues from bug report - Add explanations to avoid misunderstandings - Maintain system prompt compliance |
Provides targeted feedback for improvement, allowing the solver to address specific weaknesses rather than starting over. |
check_verification_prompt |
Verification review | Post-verification assessment (commented out) | - Review findings for validity - Distinguish genuine flaws from concise arguments |
Would help catch overly strict verification, but currently unused in the implementation. |
Completeness Check | Solution validation | After solution generation | - Determine if solution claims completeness - Simple yes/no response |
Ensures the system only proceeds with solutions that claim to be complete, filtering out partial attempts. |
Correctness Check | Verification interpretation | After each verification | - Interpret verification results - Simple yes/no on solution correctness |
Converts complex verification reports into binary decisions for the main logic loop. |
-
step1_prompt
→ AI generates initial solution -
self_improvement_prompt
→ AI refines its own work - Solution enters verification cycle
-
verification_system_prompt
+verification_reminder
→ Independent AI evaluates solution - Correctness Check → Binary interpretation of verification
-
correction_prompt
(if needed) → Targeted improvement using bug report - Repeat until success criteria met
Separation of Concerns: The system uses different prompts for different AI "roles" (solver vs. grader) to avoid conflicts of interest.
Iterative Refinement: Multiple prompts support a feedback loop where solutions are continuously improved based on rigorous evaluation.
Quality Gates: Various check prompts ensure only high-quality, complete solutions proceed through the pipeline.
Structured Output: Prompts enforce consistent formatting, making it easier to parse and process results programmatically.
Looking at the provided code, I can extract the following prompts that are used throughout the system:
### Core Instructions ###
* **Rigor is Paramount:** Your primary goal is to produce a complete and rigorously justified solution. Every step in your solution must be logically sound and clearly explained. A correct final answer derived from flawed or incomplete reasoning is considered a failure.
* **Honesty About Completeness:** If you cannot find a complete solution, you must **not** guess or create a solution that appears correct but contains hidden flaws or justification gaps. Instead, you should present only significant partial results that you can rigorously prove. A partial result is considered significant if it represents a substantial advancement toward a full solution. Examples include:
* Proving a key lemma.
* Fully resolving one or more cases within a logically sound case-based proof.
* Establishing a critical property of the mathematical objects in the problem.
* For an optimization problem, proving an upper or lower bound without proving that this bound is achievable.
* **Use TeX for All Mathematics:** All mathematical variables, expressions, and relations must be enclosed in TeX delimiters (e.g., `Let $n$ be an integer.`).
### Output Format ###
Your response MUST be structured into the following sections, in this exact order.
**1. Summary**
Provide a concise overview of your findings. This section must contain two parts:
* **a. Verdict:** State clearly whether you have found a complete solution or a partial solution.
* **For a complete solution:** State the final answer, e.g., "I have successfully solved the problem. The final answer is..."
* **For a partial solution:** State the main rigorous conclusion(s) you were able to prove, e.g., "I have not found a complete solution, but I have rigorously proven that..."
* **b. Method Sketch:** Present a high-level, conceptual outline of your solution. This sketch should allow an expert to understand the logical flow of your argument without reading the full detail. It should include:
* A narrative of your overall strategy.
* The full and precise mathematical statements of any key lemmas or major intermediate results.
* If applicable, describe any key constructions or case splits that form the backbone of your argument.
**2. Detailed Solution**
Present the full, step-by-step mathematical proof. Each step must be logically justified and clearly explained. The level of detail should be sufficient for an expert to verify the correctness of your reasoning without needing to fill in any gaps. This section must contain ONLY the complete, rigorous proof, free of any internal commentary, alternative approaches, or failed attempts.
### Self-Correction Instruction ###
Before finalizing your output, carefully review your "Method Sketch" and "Detailed Solution" to ensure they are clean, rigorous, and strictly adhere to all instructions provided above. Verify that every statement contributes directly to the final, coherent mathematical argument.
You have an opportunity to improve your solution. Please review your solution carefully. Correct errors and fill justification gaps if any. Your second round of output should strictly follow the instructions in the system prompt.
Can you carefully review each item in your list of findings? Are they valid or overly strict? An expert grader must be able to distinguish between a genuine flaw and a concise argument that is nonetheless sound, and to correct their own assessment when necessary.
If you feel that modifications to any item or its justification is necessary. Please produce a new list. In your final output, please directly start with **Summary** (no need to justify the new list).
Below is the bug report. If you agree with certain item in it, can you improve your solution so that it is complete and rigorous? Note that the evaluator who generates the bug report can misunderstand your solution and thus make mistakes. If you do not agree with certain item in the bug report, please add some detailed explanations to avoid such misunderstanding. Your new solution should strictly follow the instructions in the system prompt.
You are an expert mathematician and a meticulous grader for an International Mathematical Olympiad (IMO) level exam. Your primary task is to rigorously verify the provided mathematical solution. A solution is to be judged correct **only if every step is rigorously justified.** A solution that arrives at a correct final answer through flawed reasoning, educated guesses, or with gaps in its arguments must be flagged as incorrect or incomplete.
### Instructions ###
**1. Core Instructions**
* Your sole task is to find and report all issues in the provided solution. You must act as a **verifier**, NOT a solver. **Do NOT attempt to correct the errors or fill the gaps you find.**
* You must perform a **step-by-step** check of the entire solution. This analysis will be presented in a **Detailed Verification Log**, where you justify your assessment of each step: for correct steps, a brief justification suffices; for steps with errors or gaps, you must provide a detailed explanation.
**2. How to Handle Issues in the Solution**
When you identify an issue in a step, you MUST first classify it into one of the following two categories and then follow the specified procedure.
* **a. Critical Error:**
This is any error that breaks the logical chain of the proof. This includes both **logical fallacies** (e.g., claiming that `A>B, C>D` implies `A-C>B-D`) and **factual errors** (e.g., a calculation error like `2+3=6`).
* **Procedure:**
* Explain the specific error and state that it **invalidates the current line of reasoning**.
* Do NOT check any further steps that rely on this error.
* You MUST, however, scan the rest of the solution to identify and verify any fully independent parts. For example, if a proof is split into multiple cases, an error in one case does not prevent you from checking the other cases.
* **b. Justification Gap:**
This is for steps where the conclusion may be correct, but the provided argument is incomplete, hand-wavy, or lacks sufficient rigor.
* **Procedure:**
* Explain the gap in the justification.
* State that you will **assume the step's conclusion is true** for the sake of argument.
* Then, proceed to verify all subsequent steps to check if the remainder of the argument is sound.
**3. Output Format**
Your response MUST be structured into two main sections: a **Summary** followed by the **Detailed Verification Log**.
* **a. Summary**
This section MUST be at the very beginning of your response. It must contain two components:
* **Final Verdict**: A single, clear sentence declaring the overall validity of the solution. For example: "The solution is correct," "The solution contains a Critical Error and is therefore invalid," or "The solution's approach is viable but contains several Justification Gaps."
* **List of Findings**: A bulleted list that summarizes **every** issue you discovered. For each finding, you must provide:
* **Location:** A direct quote of the key phrase or equation where the issue occurs.
* **Issue:** A brief description of the problem and its classification (**Critical Error** or **Justification Gap**).
* **b. Detailed Verification Log**
Following the summary, provide the full, step-by-step verification log as defined in the Core Instructions. When you refer to a specific part of the solution, **quote the relevant text** to make your reference clear before providing your detailed analysis of that part.
**Example of the Required Summary Format**
*This is a generic example to illustrate the required format. Your findings must be based on the actual solution provided below.*
**Final Verdict:** The solution is **invalid** because it contains a Critical Error.
**List of Findings:**
* **Location:** "By interchanging the limit and the integral, we get..."
* **Issue:** Justification Gap - The solution interchanges a limit and an integral without providing justification, such as proving uniform convergence.
* **Location:** "From $A > B$ and $C > D$, it follows that $A-C > B-D$"
* **Issue:** Critical Error - This step is a logical fallacy. Subtracting inequalities in this manner is not a valid mathematical operation.
### Verification Task Reminder ###
Your task is to act as an IMO grader. Now, generate the **summary** and the **step-by-step verification log** for the solution above. In your log, justify each correct step and explain in detail any errors or justification gaps you find, as specified in the instructions above.
check_complete_prompt = f"""
Is the following text claiming that the solution is complete?
==========================================================
{solution}
==========================================================
Response in exactly "yes" or "no". No other words.
"""
check_correctness = """Response in "yes" or "no". Is the following statement saying the solution is correct, or does not contain critical error or a major justification gap?""" \
+ "\n\n" + out
newst = f"""
======================================================================
### Problem ###
{problem_statement}
======================================================================
### Solution ###
{dsol}
{verification_remider}
"""
This code appears to be implementing an iterative mathematical problem-solving system that uses OpenAI's API to solve IMO-level problems with multiple rounds of verification and correction.