RL PPO agent diverges with one-step training
Hi,
I am training my PPO agent based on a system with continuous action space, and I want to have my agent trains for only one step and one episode in each train() function, and see how it performs:
trainingOpts = rlTrainingOptions(…
MaxEpisodes=1, …
MaxStepsPerEpisode=1, …
Verbose=false, …
Plots="none",…
StopTrainingCriteria="AverageReward",…
StopTrainingValue=480);
This is the settings of the agent:
function [agents,obsInfo,actionInfo] = generate_PPOagents(Ts)
%observation and action spaces
obsInfo = rlNumericSpec([2 1],’LowerLimit’,-inf*ones(2,1),’UpperLimit’,inf*ones(2,1));
obsInfo.Name = ‘state’;
obsInfo.Description = ‘position, velocity’;
actionInfo = rlNumericSpec([1 1],’LowerLimit’,-inf,’UpperLimit’,inf);
actionInfo.Name = ‘continuousAction’;
agentOptions = rlPPOAgentOptions(…
‘DiscountFactor’, 0.99,…
‘EntropyLossWeight’, 0.01,…
‘ExperienceHorizon’, 20,…
‘MiniBatchSize’, 20,…
‘ClipFactor’, 0.2,…
‘NormalizedAdvantageMethod’,’none’,…
‘SampleTime’, -1);
agent1 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
agent2 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
agents = [agent1,agent2];
end
my reward is a conditional one based on whether the states satisfy some conditions:
function [nextObs, reward, isDone, loggedSignals] = myStepFunction1(action, loggedSignals,S)
nextObs = S.A1d*[loggedSignals.State(1);loggedSignals.State(2)] + S.B1d*action;
loggedSignals.State = nextObs;
if abs(nextObs(1))>10 || abs(nextObs(2))>10
reward = S.test-100;
else
reward = -1*(nextObs(1)^2 + nextObs(2)^2);
end
isDone = false;
end
in this case, every time the system finishes train(), the agent moves forward 1 step using getAction(), then I modify the reset function and then update the env so that each time the next train() simulates, the agent will start at the new state, then do trian() again to carry out the loop. But when I simulate the system, the states diverges to Inf after just around 20 train() iterations, I have checked my env, the agent settings, all seems fine. I tested if the issue is from the penalty in the reward function by changing S.test above, but the simulation fails as well.
I am not sure if the issue is caused by the one episode one step training method, in theory I am expecting bad performance at first but it should not be diverging so fast to Inf.
Thanks.Hi,
I am training my PPO agent based on a system with continuous action space, and I want to have my agent trains for only one step and one episode in each train() function, and see how it performs:
trainingOpts = rlTrainingOptions(…
MaxEpisodes=1, …
MaxStepsPerEpisode=1, …
Verbose=false, …
Plots="none",…
StopTrainingCriteria="AverageReward",…
StopTrainingValue=480);
This is the settings of the agent:
function [agents,obsInfo,actionInfo] = generate_PPOagents(Ts)
%observation and action spaces
obsInfo = rlNumericSpec([2 1],’LowerLimit’,-inf*ones(2,1),’UpperLimit’,inf*ones(2,1));
obsInfo.Name = ‘state’;
obsInfo.Description = ‘position, velocity’;
actionInfo = rlNumericSpec([1 1],’LowerLimit’,-inf,’UpperLimit’,inf);
actionInfo.Name = ‘continuousAction’;
agentOptions = rlPPOAgentOptions(…
‘DiscountFactor’, 0.99,…
‘EntropyLossWeight’, 0.01,…
‘ExperienceHorizon’, 20,…
‘MiniBatchSize’, 20,…
‘ClipFactor’, 0.2,…
‘NormalizedAdvantageMethod’,’none’,…
‘SampleTime’, -1);
agent1 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
agent2 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
agents = [agent1,agent2];
end
my reward is a conditional one based on whether the states satisfy some conditions:
function [nextObs, reward, isDone, loggedSignals] = myStepFunction1(action, loggedSignals,S)
nextObs = S.A1d*[loggedSignals.State(1);loggedSignals.State(2)] + S.B1d*action;
loggedSignals.State = nextObs;
if abs(nextObs(1))>10 || abs(nextObs(2))>10
reward = S.test-100;
else
reward = -1*(nextObs(1)^2 + nextObs(2)^2);
end
isDone = false;
end
in this case, every time the system finishes train(), the agent moves forward 1 step using getAction(), then I modify the reset function and then update the env so that each time the next train() simulates, the agent will start at the new state, then do trian() again to carry out the loop. But when I simulate the system, the states diverges to Inf after just around 20 train() iterations, I have checked my env, the agent settings, all seems fine. I tested if the issue is from the penalty in the reward function by changing S.test above, but the simulation fails as well.
I am not sure if the issue is caused by the one episode one step training method, in theory I am expecting bad performance at first but it should not be diverging so fast to Inf.
Thanks. Hi,
I am training my PPO agent based on a system with continuous action space, and I want to have my agent trains for only one step and one episode in each train() function, and see how it performs:
trainingOpts = rlTrainingOptions(…
MaxEpisodes=1, …
MaxStepsPerEpisode=1, …
Verbose=false, …
Plots="none",…
StopTrainingCriteria="AverageReward",…
StopTrainingValue=480);
This is the settings of the agent:
function [agents,obsInfo,actionInfo] = generate_PPOagents(Ts)
%observation and action spaces
obsInfo = rlNumericSpec([2 1],’LowerLimit’,-inf*ones(2,1),’UpperLimit’,inf*ones(2,1));
obsInfo.Name = ‘state’;
obsInfo.Description = ‘position, velocity’;
actionInfo = rlNumericSpec([1 1],’LowerLimit’,-inf,’UpperLimit’,inf);
actionInfo.Name = ‘continuousAction’;
agentOptions = rlPPOAgentOptions(…
‘DiscountFactor’, 0.99,…
‘EntropyLossWeight’, 0.01,…
‘ExperienceHorizon’, 20,…
‘MiniBatchSize’, 20,…
‘ClipFactor’, 0.2,…
‘NormalizedAdvantageMethod’,’none’,…
‘SampleTime’, -1);
agent1 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
agent2 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
agents = [agent1,agent2];
end
my reward is a conditional one based on whether the states satisfy some conditions:
function [nextObs, reward, isDone, loggedSignals] = myStepFunction1(action, loggedSignals,S)
nextObs = S.A1d*[loggedSignals.State(1);loggedSignals.State(2)] + S.B1d*action;
loggedSignals.State = nextObs;
if abs(nextObs(1))>10 || abs(nextObs(2))>10
reward = S.test-100;
else
reward = -1*(nextObs(1)^2 + nextObs(2)^2);
end
isDone = false;
end
in this case, every time the system finishes train(), the agent moves forward 1 step using getAction(), then I modify the reset function and then update the env so that each time the next train() simulates, the agent will start at the new state, then do trian() again to carry out the loop. But when I simulate the system, the states diverges to Inf after just around 20 train() iterations, I have checked my env, the agent settings, all seems fine. I tested if the issue is from the penalty in the reward function by changing S.test above, but the simulation fails as well.
I am not sure if the issue is caused by the one episode one step training method, in theory I am expecting bad performance at first but it should not be diverging so fast to Inf.
Thanks. ppo, reinforcement learning, training, converge MATLAB Answers — New Questions