AutoGen‐Notes - KrArunT/InfobellIT-Gen-AI GitHub Wiki

Notes:

  • Build predictable, ordered workflows with agents, and to integrate them with new user interfaces that are not chat-based.
  • Deterministic, ordered workflows and event-driven or decentralized workflows,
  • Human-in-the-loop scenarios, both active and reactive.
  • AgentsOps

AgentEval

image Source

  • Offer a scalable and cost-effective alternative to traditional human evaluations.
  • The framework comprises three main agents: CriticAgent, QuantifierAgent, and VerifierAgent, each playing a crucial role in assessing the task utility of an application.

CriticAgent: Defining the Criteria

The CriticAgent's primary function is to suggest a set of criteria for evaluating an application based on the task description and examples of successful and failed executions. For instance, in the context of a math tutoring application, the CriticAgent might propose criteria such as efficiency, clarity, and correctness. These criteria are essential for understanding the various dimensions of the application's performance. It’s highly recommended that application developers validate the suggested criteria leveraging their domain expertise.

QuantifierAgent: Quantifying the Performance

Once the criteria are established, the QuantifierAgent takes over to quantify how well the application performs against each criterion. This quantification process results in a multi-dimensional assessment of the application's utility, providing a detailed view of its strengths and weaknesses.

VerifierAgent: Ensuring Robustness and Relevance

VerifierAgent ensures the criteria used to evaluate a utility are effective for the end-user, maintaining both robustness and high discriminative power. It does this through two main actions:

  1. Criteria Stability:

Ensures criteria are essential, non-redundant, and consistently measurable. Iterates over generating and quantifying criteria, eliminating redundancies, and evaluating their stability. Retains only the most robust criteria.

  1. Discriminative Power:

Tests the system's reliability by introducing adversarial examples (noisy or compromised data). Assesses the system's ability to distinguish these from standard cases. If the system fails, it indicates the need for better criteria to handle varied conditions effectively.

  1. Human evaluation (SME)

AgentEval currently has two main stages; criteria generation and criteria quantification (criteria verification is still under development). Both stages make use of sequential LLM-powered agents to make their determinations.

  • What's an agent? There are many different types of definitions of agents. When building AutoGen, I was looking for the most generic notion that can incorporate all these different types of definitions. And to do that we really need to think about the minimal set of concepts that are needed.

In AutoGen, we think about the agent as an entity that can act on behalf of human intent. They can send messages, receive messages, respond to other agents after taking actions and interact with other agents. We think it's a minimal set of capabilities that an agent needs to have underneath. They can have different types of backends to support them to perform actions and generate replies. Some of the agents can use AI models to generate replies. Some other agents can use functions underneath to generate tool-based replies and other agents can use human input as a way to reply to other agents. And you can also have agents that mix these different types of backends or have more complex agents that have internal conversations among multiple agents. But on the surface, other agents still perceive it as a single entity to communicate to.

With this definition, we can incorporate both very simple agents that can solve simple tasks with a single backend, but also we can have agents that are composed of multiple simpler agents. One can recursively build up more powerful agents. The agent concept in AutoGen can cover all these different complexities.

What are the pros and cons of multi vs. single agent? This question can be asked in a variety of ways.

Why should I use multiple agents instead of a single agent?

Why think about multi-agents when we don't have a strong single agent?

Does multi-agent increase the complexity, latency and cost?

When should I use multi-agents vs. single agent?

When we use the word 'multi-agent' and 'single-agent', I think there are at least two different dimensions we need to think about.

Interface. This means, from the user's point of view, do they interact with the system in a single interaction point or do they see explicitly multiple agents working and need to interact with multiple of them? Architecture. Are there multiple agents underneath running at the backend? A particular system can have a single-agent interface and a multi-agent architecture, but the users don't need to know that.

Interface A single interaction point can make many applications' user experience more straightforward. There are also cases where it is not the best solution. For example, when the application is about having multiple agents debate about a subject, the users need to see what each agent says. In that case, it's beneficial for them to actually see the multi-agents' behavior. Another example is the social simulation experiment: People also want to see the behavior of each agent.

  • What are the pros and cons of multi vs. single agent?

AutoDefense demonstrates that using multi-agents reduces the risk of suffering from jailbreak attacks. Source

AutoGen Technical Report [AutoGen Technical Report, Updated] (https://arxiv.org/pdf/2308.08155)

AutoGen Blogs

Challanges with agents

  • Exploding context
  • LLM call Cost (Num Tokens)

TBD

  • Mixtral
  • DeepSeek
  • Divide and Conquer (With Gents)
  • Llama Guard agent
  • GenAI in Common words
  • Web-surfer

Agent Frameworks

  • OpenAI Swarm
  • Autogen
  • SmolGents