Algorithms Explained - KunjShah01/RL-A2A GitHub Wiki

Algorithms Explained

This section provides a deep dive into the RL algorithms implemented in RL-A2A.

Actor-to-Actor (A2A)

A2A is a variant of the Actor-Critic family, designed to allow multiple actor networks to interact or compete/cooperate, improving exploration and robustness.

Actor Networks: Learn policy directly.
Critic Networks: Estimate value functions.
Multi-Actor Coordination: Unique feature of A2A, enabling richer behaviors.

Other Algorithms

A2C: Advantage Actor-Critic, a synchronous, deterministic variant of A3C.
PPO: Proximal Policy Optimization, a robust and popular policy gradient method.
DQN (if implemented): Deep Q-Network, a value-based RL algorithm.

Mathematical Formulation

[ J(\theta) = \mathbb{E}{s, a \sim \pi\theta} [\log \pi_\theta(a|s) A^\pi(s, a)] ]

Where:

( \pi_\theta ): Policy parameterized by θ
( A^\pi(s, a) ): Advantage function

See References for foundational papers.

References

Sutton & Barto, Reinforcement Learning: An Introduction
PPO Paper