car racing:output - chunhualiao/public-docs GitHub Wiki

What can I learn from the output below? What should be the final output?

-----------------------------------------
| rollout/            |                 |
|---------------------|-----------------|
| ep_len_mean        | 1e+03           |
| ep_rew_mean        | -53.9           |
| time/              |                 |
| fps                | 87              |
| iterations         | 4               |
| time_elapsed       | 376             |
| total_timesteps    | 32768           |
| train/             |                 |
| approx_kl         | 0.019136123      |
| clip_fraction     | 0.155            |
| clip_range        | 0.2              |
| entropy_loss      | -4.15            |
| explained_variance | 0.37             |
| learning_rate     | 0.0003           |
| loss             | 0.0195            |
| n_updates        | 30               |
| policy_gradient_loss | -0.0167      |
| std              | 0.964            |
| value_loss       | 0.222            |
-----------------------------------------


--------------------------------------
| rollout/                |          |
|    ep_len_mean          | 1e+03    |
|    ep_rew_mean          | 301      |
| time/                   |          |
|    fps                  | 72       |
|    iterations           | 42       |
|    time_elapsed         | 4722     |
|    total_timesteps      | 344064   |
| train/                  |          |
|    approx_kl            | 6.300992 |
|    clip_fraction        | 0.86     |
|    clip_range           | 0.2      |
|    entropy_loss         | -0.317   |
|    explained_variance   | 0.928    |
|    learning_rate        | 0.0003   |
|    loss                 | 0.158    |
|    n_updates            | 410      |
|    policy_gradient_loss | 0.0584   |
|    std                  | 0.269    |
|    value_loss           | 3.48     |
--------------------------------------

Understanding the Output

rollout/ (Agent's interaction with the environment)

ep_len_mean: The average length of an episode (number of steps before the episode ends). Here, it is 1000, likely a predefined max episode length.
ep_rew_mean: The average reward per episode. Since it's -53.9, the agent is still learning and performing poorly. Over time, this should increase to a positive value.

time/ (Training speed)

fps: Frames per second processed by the agent.
iterations: The number of training iterations completed.
time_elapsed: Total time (in seconds) since training started.
total_timesteps: The total number of steps taken in the environment.

train/ (Policy update metrics)

approx_kl: Measures how much the policy changed during updates (too large can indicate instability).
clip_fraction: Fraction of updates where the policy was clipped to prevent large changes.
clip_range: The clipping threshold for PPO updates.
entropy_loss: Higher values indicate more randomness in the policy (useful for exploration).
explained_variance: How well the value function predicts future rewards (closer to 1 is better).
learning_rate: The learning rate for policy updates.
loss: Overall loss function (includes policy and value function loss).
n_updates: The number of policy updates so far.
policy_gradient_loss: How much the policy is improving (negative values indicate improvement).
std: Standard deviation of action distributions.
value_loss: Measures how well the value function is predicting future rewards.

What to Expect for the Final Output

Training Completion: The model.learn(total_timesteps=1_000_000) function will run for 1,000,000 timesteps. The ep_rew_mean should steadily increase, showing learning progress.
Model Saving: After training, the model will be saved as "ppo_carracing".
Evaluation: The script will load the trained model and test it in the environment for 5 episodes. The car should perform noticeably better if training was successful.
Error Handling: If an error occurs, you will see the message "An error occurred during training: [error message]". Otherwise, the script will complete without a message.

The most important metric to watch is ep_rew_mean, which should become less negative and eventually positive, indicating that the agent is successfully navigating the track.