PPO | RL | Training for 10000 rounds still doesn't get effective training, and the reward curve is very flat

With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.
My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.
The initial environment generates random locations through distance and angle variables
angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;
The current agent settings and training settings are as follows：
actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);

agentOpts = rlPPOAgentOptions(…
ActorOptimizerOptions=actorOpts,…
CriticOptimizerOptions=criticOpts,…
ExperienceHorizon=2500,…
ClipFactor=0.1,…
EntropyLossWeight=0.02,…
MiniBatchSize=256,…
NumEpoch=9,…
AdvantageEstimateMethod="gae",…
GAEFactor=0.95,…
SampleTime=0.01,…
DiscountFactor=0.99);

trainOpts = rlTrainingOptions(…
‘MaxEpisodes’,100000, …
‘MaxStepsPerEpisode’,2500, …
‘Verbose’,false, …
‘StopTrainingCriteria’,"AverageReward",…
‘StopTrainingValue’,1000,…
‘ScoreAveragingWindowLength’,1000);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 2^11;With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.
My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.
The initial environment generates random locations through distance and angle variables
angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;
The current agent settings and training settings are as follows：
actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);

trainOpts = rlTrainingOptions(…
‘MaxEpisodes’,100000, …
‘MaxStepsPerEpisode’,2500, …
‘Verbose’,false, …
‘StopTrainingCriteria’,"AverageReward",…
‘StopTrainingValue’,1000,…
‘ScoreAveragingWindowLength’,1000);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 2^11; With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.
My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.
The initial environment generates random locations through distance and angle variables
angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;
The current agent settings and training settings are as follows：
actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);

Cart

Cart

PPO | RL | Training for 10000 rounds still doesn’t get effective training, and the reward curve is very flat

Related posts

How can I modify the lengh of the lines in a LEGEND?

Vertical 3D to Horizontal

download 2019b update 3

Leave a Reply Cancel reply

Information

Contact Us

All Categories

Search

Cart

All Categories

Search

Cart

PPO | RL | Training for 10000 rounds still doesn’t get effective training, and the reward curve is very flat

Share this!

Related posts

How can I modify the lengh of the lines in a LEGEND?

Vertical 3D to Horizontal

download 2019b update 3

Leave a Reply Cancel reply

Sign Up For Newsletters

Information

Contact Us