car racing:output - chunhualiao/public-docs GitHub Wiki
What can I learn from the output below? What should be the final output?
-----------------------------------------
| rollout/ | |
|---------------------|-----------------|
| ep_len_mean | 1e+03 |
| ep_rew_mean | -53.9 |
| time/ | |
| fps | 87 |
| iterations | 4 |
| time_elapsed | 376 |
| total_timesteps | 32768 |
| train/ | |
| approx_kl | 0.019136123 |
| clip_fraction | 0.155 |
| clip_range | 0.2 |
| entropy_loss | -4.15 |
| explained_variance | 0.37 |
| learning_rate | 0.0003 |
| loss | 0.0195 |
| n_updates | 30 |
| policy_gradient_loss | -0.0167 |
| std | 0.964 |
| value_loss | 0.222 |
-----------------------------------------
--------------------------------------
| rollout/ | |
| ep_len_mean | 1e+03 |
| ep_rew_mean | 301 |
| time/ | |
| fps | 72 |
| iterations | 42 |
| time_elapsed | 4722 |
| total_timesteps | 344064 |
| train/ | |
| approx_kl | 6.300992 |
| clip_fraction | 0.86 |
| clip_range | 0.2 |
| entropy_loss | -0.317 |
| explained_variance | 0.928 |
| learning_rate | 0.0003 |
| loss | 0.158 |
| n_updates | 410 |
| policy_gradient_loss | 0.0584 |
| std | 0.269 |
| value_loss | 3.48 |
--------------------------------------
Understanding the Output
rollout/ (Agent's interaction with the environment)
- ep_len_mean: The average length of an episode (number of steps before the episode ends). Here, it is 1000, likely a predefined max episode length.
- ep_rew_mean: The average reward per episode. Since it's -53.9, the agent is still learning and performing poorly. Over time, this should increase to a positive value.
time/ (Training speed)
- fps: Frames per second processed by the agent.
- iterations: The number of training iterations completed.
- time_elapsed: Total time (in seconds) since training started.
- total_timesteps: The total number of steps taken in the environment.
train/ (Policy update metrics)
- approx_kl: Measures how much the policy changed during updates (too large can indicate instability).
- clip_fraction: Fraction of updates where the policy was clipped to prevent large changes.
- clip_range: The clipping threshold for PPO updates.
- entropy_loss: Higher values indicate more randomness in the policy (useful for exploration).
- explained_variance: How well the value function predicts future rewards (closer to 1 is better).
- learning_rate: The learning rate for policy updates.
- loss: Overall loss function (includes policy and value function loss).
- n_updates: The number of policy updates so far.
- policy_gradient_loss: How much the policy is improving (negative values indicate improvement).
- std: Standard deviation of action distributions.
- value_loss: Measures how well the value function is predicting future rewards.
What to Expect for the Final Output
- Training Completion: The
model.learn(total_timesteps=1_000_000)
function will run for 1,000,000 timesteps. Theep_rew_mean
should steadily increase, showing learning progress. - Model Saving: After training, the model will be saved as
"ppo_carracing"
. - Evaluation: The script will load the trained model and test it in the environment for 5 episodes. The car should perform noticeably better if training was successful.
- Error Handling: If an error occurs, you will see the message
"An error occurred during training: [error message]"
. Otherwise, the script will complete without a message.
The most important metric to watch is ep_rew_mean, which should become less negative and eventually positive, indicating that the agent is successfully navigating the track.