Agent S - chunhualiao/public-docs GitHub Wiki

Agent S: Overview and Workflow

Agent S is designed as an agentic framework to interact with computer GUIs (Graphical User Interfaces) to perform complex, multi-step tasks, similar to how a human uses a computer. Its core components and workflow are illustrated in Figure 3 (page 4) and described throughout Section 3.

Goal

To automate diverse desktop tasks by directly controlling the mouse and keyboard via the GUI.

Core Strategy

Experience-Augmented Hierarchical Planning: Breaks down complex tasks into smaller subtasks, and uses both external knowledge and internal past experiences (memory) to figure out how to perform them.

Main Components and Workflow

Input

User task (Tu) and initial computer screen observation (O0).

Manager Module

Receives the task and observation.
Retrieval:
- External Knowledge: Performs an Online Web Search (like Perplexica) to get up-to-date instructions (Section 3.1.1).
- Internal Experience: Queries its Narrative Memory (Mn) for past similar tasks (Section 3.1.1).
Fusion & Planning:
- Fuses external (Kweb) and internal (En) knowledge using an LLM.
- Subtask Planner generates a sequence of subtasks (s0, ..., sn) with associated context (Csi) (Section 3.1.1).

Worker Modules

Each subtask (si) is assigned to a Worker (wi) for execution (Section 3.1.2).
Retrieval & Reflection:
- Queries Episodic Memory (Me) for step-by-step past experiences (Esi).
- Uses a Trajectory Reflector (TRi) for real-time advice during execution.
Action Generation:
- Based on past experience (Esi), current observation, and reflection, decides next action (at) (e.g., click, type) using a Chain-of-Thought process.
Subtask Completion:
- Executes actions until the subtask is completed, then signals 'DONE' or 'FAIL'.

Agent-Computer Interface (ACI)

Perception:
- Captures screenshots and Accessibility Tree (UI structure, locations, OCR-enhanced).
Action Execution:
- Translates primitive actions into real mouse/keyboard commands.
- Uses a constrained action space (click, type, scroll) for safety and immediate feedback (Table 5, Appendix A.1).

Self-Evaluation and Memory Update

Episodic Update:
- Upon subtask completion, summarizes the strategy (Rsi) and updates Episodic Memory (Me) (Section 3.1.3, Appendix C.3).
Narrative Update:
- Upon full task completion, summarizes the entire experience (Enu) and updates Narrative Memory (Mn) (Section 3.1.3, Appendix C.3).
Continual Learning:
- Feedback loop enables learning from successes and failures.
- Initial memory built through self-supervised exploration (Section 3.2, Figure 4).

Essence of Agent S

Agent S operates hierarchically:

The Manager plans high-level subtasks using web knowledge and narrative memory.
Workers execute subtasks using episodic memory and real-time reflection.
ACI bridges decisions to GUI actions.
Self-Evaluator enables continual improvement by memory updates.

Ablation Studies on Agent S Design Choices

Yes, the authors explicitly studied whether the design choices were necessary through ablation studies, which are presented in Section 4.3 (Ablation Study) and Appendix A.2 & A.3 of the paper.

They systematically removed or modified key components of the Agent S framework to observe the impact on performance using a subset of the OSWorld benchmark (test_sub). Here's a summary of what they investigated and found:

Learning from Experience Components (Table 2, Figure 6)

Web Knowledge: Removing external web search significantly reduced performance, showing the importance of up-to-date, general knowledge.
Narrative Memory: Removing retrieval of full-task experiences led to a noticeable drop in performance.
Episodic Memory: Removing retrieval of subtask-specific experiences also degraded performance.
Removing All Experience: When all three learning sources (Web, Narrative, Episodic) were removed, the performance dropped drastically, becoming only slightly better than the baseline OSWorld agent. This highlights the critical role of the overall experience-augmented approach.

Agent-Computer Interface (ACI) (Figure 6, Appendix Table 6)

They compared the baseline OSWorld agent to Agent S versions with and without the ACI and retrieval components.
The results show that the ACI significantly enhances performance, particularly when combined with the experiential learning (retrieval) components.
It demonstrates ACI's effectiveness in improving grounding and enabling better agentic learning compared to the baseline interface.

Hierarchical Planning (Section 4.3 text, comparing results in Figure 6)

The study implicitly tested the hierarchical structure by comparing the full Agent S (which includes hierarchy) against a version with ACI and Experiential Learning but without the Manager/Worker hierarchical split (labeled "Agent S (ACI + Retrieval)").
The full Agent S performed significantly better (26.15% vs 20.00% on test_sub), underscoring the importance of hierarchical planning for breaking down complex, long-horizon tasks, especially when combined with the learned experience components.

Memory Construction & Update Mechanisms (Figure 7, Appendix Table 7)

Self-supervised Exploration: Removing the initial exploration phase (where the agent builds initial memory) resulted in a major performance drop, indicating the necessity of bootstrapping the memory.
Continual Memory Update: Removing the ability to learn from new tasks during inference (only using the initial exploration memory) also reduced performance, showing the value of ongoing learning.
Self-Evaluator: Replacing the summarized experiences (generated by the Self-Evaluator) with the original full trajectories for memory storage led to lower performance, demonstrating the benefit of using concise, abstracted summaries for learning.

Conclusion

The ablation studies performed by the authors provide strong evidence that the major components of Agent S – the experience-augmented retrieval (Web, Narrative, Episodic), the Agent-Computer Interface, the Hierarchical Planning structure, and the specific memory construction/update mechanisms (Exploration, Continual Update, Self-Evaluation) – are indeed necessary and contribute positively to the framework's overall effectiveness and state-of-the-art performance. The complexity appears justified by the performance gains achieved through these integrated components.

Difference between Narrative Memory and Episodic Memory in Agent S

Let's elaborate on the difference between Narrative Memory and Episodic Memory within the Agent S framework, as described in the paper (primarily in Sections 3.1.1, 3.1.2, 3.1.3, and visualized in Figure 3):

Think of it as two different levels of remembering how to do things:

Narrative Memory (`Mn`)

Purpose: Used by the Manager for high-level planning. It helps decompose a complex user task into a sequence of manageable subtasks.
Content: Stores summaries of entire past tasks. These summaries are abstractive, capturing the overall strategy or flow of completing a full task (like "how to create a presentation summarizing sales data") but removing specific low-level actions (like exact click coordinates or element IDs). It includes both successful and failed past task experiences.
Granularity: Coarse-grained. Focuses on the whole task journey.
Retrieval: Queried using the initial user task (Tu) and observation (O0) to find similar overall tasks encountered previously.
Update: Updated by the Self-Evaluator at the end of the entire task with a summarized textual description of the overall strategy used for that complete task.
Analogy: Like remembering the main chapters or phases required to write a research paper (e.g., Literature Review → Methodology → Experiments → Write-up), without remembering the exact sentence you typed first in the methodology section.

Episodic Memory (`Me`)

Purpose: Used by the Worker for low-level execution. It helps determine the specific, step-by-step actions needed to complete a current subtask.
Content: Stores detailed experiences from specific past subtasks. These include the complete sequence of grounded actions (e.g., agent.click(element_id=42), agent.type(text='Hello')) used to successfully complete a subtask previously. It only stores summaries from successful subtask completions (marked as DONE).
Granularity: Fine-grained. Focuses on the step-by-step execution within one subtask.
Retrieval: Queried using the current subtask (si) and its context (Csi) to find similar specific subtasks performed previously.
Update: Updated by the Self-Evaluator after each successful subtask completion with a summarized textual description of the strategy used for that specific subtask.
Analogy: Like remembering the exact sequence of commands or clicks you used to successfully run a specific experiment or generate a specific plot for the methodology section of your paper.

In Summary

Feature	Narrative Memory (`Mn`)	Episodic Memory (`Me`)
User	Manager	Worker
Purpose	High-level Task Planning (Subtask decomposition)	Low-level Subtask Execution (Action generation)
Scope	Entire Task	Single Subtask
Content	Abstractive summaries of full tasks	Detailed step-by-step successful subtasks
Detail Level	Actions removed, overall strategy kept	Specific grounding actions included
Retrieval Key	Overall Task Query (`Q`)	Subtask Query (`Tu, si, Csi`)
Update Time	End of Entire Task	End of each Successful Subtask

This two-level memory system allows Agent S to leverage past experiences effectively:

Narrative Memory helps structure the overall approach to a complex problem.
Episodic Memory provides the concrete, detailed steps needed to execute each part of that approach.

How Agent S Manages Context Window Limitations

The paper doesn't explicitly detail prompt engineering specifics or how they precisely truncate/manage inputs to fit context windows, but the design of Agent S incorporates several strategies to implicitly and explicitly deal with context window limitations:

Hierarchical Decomposition

Key idea: Break a long, complex task into smaller, sequential subtasks (Manager → Workers), so the agent doesn't need to keep the entire history and plan in context at every step.
Manager:
- Plans subtasks based on the initial task, observation, and retrieved high-level memories/web knowledge.
- Requires relatively high but focused context for planning.
Worker:
- Needs context only for:
  - Current subtask (si)
  - Associated context (Csi)
  - Current screen observation (Ot)
  - Retrieved relevant episodic memories (Esi)
  - Reflection
- Does not need the full plan or unrelated subtask history, drastically reducing the required context size for action generation.

Retrieval-Augmented Generation (RAG)

Selective Retrieval:
- Manager retrieves relevant narrative summaries and web search results.
- Worker retrieves relevant past episodic (subtask) experiences.
Benefit:
- Only pertinent information is fed into the context window.
- Avoids stuffing the entire memory into the prompt.

Summarization of Experiences

Self-Evaluator:
- Summarizes both full tasks (for Narrative Memory) and successful subtasks (for Episodic Memory) instead of storing raw, lengthy trajectories (Section 3.1.3).
Benefits:
- Summaries are more concise than full action logs.
- Summaries consume less space in the context window.
- Ablation study (Section 4.3, Figure 7) showed summaries are more effective than full trajectories, supporting their role in managing information density.

Focused Observation Representation (ACI)

Agent-Computer Interface (ACI) (Section 3.3):
- Provides observations via:
  - A screenshot
  - An augmented Accessibility Tree
Accessibility Tree:
- Provides structured information (elements, IDs, coordinates).
- Allows compact and direct interaction representation.
- Using element IDs (e.g., click(41)) is very token-efficient compared to describing images.

Single Action Steps

Constraint:
- The ACI requires the agent to output only one primitive action per time step (Section 3.3).
Impact:
- Output from the LLM is small and structured.
- Reduces pressure on the context window from the generation side.

Summary

In essence, Agent S manages context limitations not by having an infinitely large window, but by:

Structuring the problem hierarchically (Manager/Workers).
Selective retrieval of only the most relevant summarized information.
Using efficient representations for observations and actions.

This ensures that the information processed by the LLM at each decision point (planning or action generation) is focused and fits within the model's operational context limits.