PPO | RL | Training for 10000 rounds still doesn’t get effective training, and the reward curve is very flat
With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.
My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.
The initial environment generates random locations through distance and angle variables
angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;
The current agent settings and training settings are as follows:
actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
agentOpts = rlPPOAgentOptions(…
ActorOptimizerOptions=actorOpts,…
CriticOptimizerOptions=criticOpts,…
ExperienceHorizon=2500,…
ClipFactor=0.1,…
EntropyLossWeight=0.02,…
MiniBatchSize=256,…
NumEpoch=9,…
AdvantageEstimateMethod="gae",…
GAEFactor=0.95,…
SampleTime=0.01,…
DiscountFactor=0.99);
trainOpts = rlTrainingOptions(…
‘MaxEpisodes’,100000, …
‘MaxStepsPerEpisode’,2500, …
‘Verbose’,false, …
‘StopTrainingCriteria’,"AverageReward",…
‘StopTrainingValue’,1000,…
‘ScoreAveragingWindowLength’,1000);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 2^11;With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.
My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.
The initial environment generates random locations through distance and angle variables
angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;
The current agent settings and training settings are as follows:
actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
agentOpts = rlPPOAgentOptions(…
ActorOptimizerOptions=actorOpts,…
CriticOptimizerOptions=criticOpts,…
ExperienceHorizon=2500,…
ClipFactor=0.1,…
EntropyLossWeight=0.02,…
MiniBatchSize=256,…
NumEpoch=9,…
AdvantageEstimateMethod="gae",…
GAEFactor=0.95,…
SampleTime=0.01,…
DiscountFactor=0.99);
trainOpts = rlTrainingOptions(…
‘MaxEpisodes’,100000, …
‘MaxStepsPerEpisode’,2500, …
‘Verbose’,false, …
‘StopTrainingCriteria’,"AverageReward",…
‘StopTrainingValue’,1000,…
‘ScoreAveragingWindowLength’,1000);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 2^11; With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.
My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.
The initial environment generates random locations through distance and angle variables
angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;
The current agent settings and training settings are as follows:
actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
agentOpts = rlPPOAgentOptions(…
ActorOptimizerOptions=actorOpts,…
CriticOptimizerOptions=criticOpts,…
ExperienceHorizon=2500,…
ClipFactor=0.1,…
EntropyLossWeight=0.02,…
MiniBatchSize=256,…
NumEpoch=9,…
AdvantageEstimateMethod="gae",…
GAEFactor=0.95,…
SampleTime=0.01,…
DiscountFactor=0.99);
trainOpts = rlTrainingOptions(…
‘MaxEpisodes’,100000, …
‘MaxStepsPerEpisode’,2500, …
‘Verbose’,false, …
‘StopTrainingCriteria’,"AverageReward",…
‘StopTrainingValue’,1000,…
‘ScoreAveragingWindowLength’,1000);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 2^11; ppo, reinforcement learning, reward curve MATLAB Answers — New Questions