Agent S:example - chunhualiao/public-docs GitHub Wiki

Concrete Example Walkthrough of Agent S

Let's walk through a concrete example to illustrate how Agent S works, incorporating the technical details described in the paper.

User Task (Tu):
"In LibreOffice Calc, open the file /home/user/Documents/report.ods. Find the sum of column C (cells C2 to C50) and put the result in cell D1. Then, make the text in D1 bold."
Initial Environment:
A standard Ubuntu desktop view. The file report.ods exists in the specified directory. LibreOffice Calc is not yet open.

ACI captures:
- Screenshot of the desktop.
- Accessibility Tree containing interactable elements (e.g., Files icon, Documents folder icon).

Observation-Aware Query:
- Q = LLM(Tu, O0) → Generate query to plan.
External Knowledge Retrieval:
- Kweb = Retrieve(Web, Q) → Instructions like opening files, using SUM formula, bolding text.
Internal Narrative Memory Retrieval:
- En = Retrieve(Mn, Q) → Abstract summary of similar full tasks.
Fusion & Subtask Planning:
- {(s0, Cs0), (s1, Cs1), (s2, Cs2)} = MLLM(Kfused)
  (where Kfused = LLM(En, Kweb))
- Subtasks:
  - s0: Open /home/user/Documents/report.ods using LibreOffice Calc.
  - s1: Calculate the sum of C2 to C50 into D1.
  - s2: Make the text in D1 bold.

Episodic Memory Retrieval:
- Es0 = Retrieve(Me, (Tu, s0, Cs0))
Trajectory Reflection:
- Observes the execution (initially no reflection needed).
Action Generation & Execution:
1. a0: Click Files icon → O1
2. a1: Click Documents folder → O2
3. a2: Double-click report.ods → O3 (LibreOffice Calc opens)
Subtask Completion:
- Worker w0 sees report.ods open. Signals DONE.

Episodic Memory Retrieval:
- Es1 = Retrieve(Me, (Tu, s1, Cs1))
Action Generation & Execution:
1. a3: Click D1 cell → O4
2. a4: Type =SUM(C2:C50) and press Enter → O5
Subtask Completion:
- Worker w1 sees the sum in D1. Signals DONE.

Episodic Memory Retrieval:
- Es2 = Retrieve(Me, (Tu, s2, Cs2))
Action Generation & Execution:
1. a5: Re-select D1 if needed.
2. a6: Click Bold button → O7
Subtask Completion:
- Worker w2 sees D1 text is bold. Signals DONE.

Manager plans high-level subtasks using web/narrative memory.
Workers execute subtasks step-by-step using episodic memory and reflection.
ACI handles perception and action execution.
Self-Evaluator updates both episodic and narrative memories based on task outcomes.