Episode reward is too high to be true

My problem is that my reward function is well defined to be expected in a range of -80 to 80. Running through a lot of times of my environment step function and the step reward is working fine. Below is the sample graph of reward against each step when the action inputted for each step is fully randomised.

However, whenever I train it with my PPO agent, the episode reward will be absurbly high, as shown in the table below.

It’s an absurb episode reward value for a 45 steps per episode training. That just means every step reward is 187717.36/45 = 4171.5 average, which is impossible to happen.

Another problem is I can’t terminate the training by using the ‘stop training’ in the reinforcement learning training monitor. I can stop the training by pausing the matlab and the window will show that the training is stopped. However, the environment is still running for no reason despite the window said that it has stopped. My environment do keep popping cmd window for each step because it links to the external software and I used task kill to close the cmd. Not sure this is the cause. Spend days and still don’t know what caused it.

Any help is appreciated.My problem is that my reward function is well defined to be expected in a range of -80 to 80. Running through a lot of times of my environment step function and the step reward is working fine. Below is the sample graph of reward against each step when the action inputted for each step is fully randomised.

However, whenever I train it with my PPO agent, the episode reward will be absurbly high, as shown in the table below.

It’s an absurb episode reward value for a 45 steps per episode training. That just means every step reward is 187717.36/45 = 4171.5 average, which is impossible to happen.

Any help is appreciated. My problem is that my reward function is well defined to be expected in a range of -80 to 80. Running through a lot of times of my environment step function and the step reward is working fine. Below is the sample graph of reward against each step when the action inputted for each step is fully randomised.

However, whenever I train it with my PPO agent, the episode reward will be absurbly high, as shown in the table below.

It’s an absurb episode reward value for a 45 steps per episode training. That just means every step reward is 187717.36/45 = 4171.5 average, which is impossible to happen.

Any help is appreciated. reinforcement learning, machine learning, reinforcement learning toolbox, ppoagent MATLAB Answers — New Questions

Cart

Cart