PPO training Stopped Learning.
I am trying to train the rotatry inverted pendulum enviroment using a PPO agent. It’s working… but It’s reaching a limit and not learnign past this limit. I am not too sure why. Newbie to RL here so go easy on me :). I think it’s something to do with the yellow line, Q0. Also it could be reaching a local optima, but I don’t think this is the problem. I think the problem is with Q0 not getting past 100 and the agent not being able to extract more useful info. Hopefully, someone whith a little more experinace has something to say!
mdl = "rlQubeServoModel";
open_system(mdl)
theta_limit = 5*pi/8;
dtheta_limit = 30;
volt_limit = 12;
Ts = 0.005;
rng(22)
obsInfo = rlNumericSpec([7 1]);
actInfo = rlNumericSpec([1 1],UpperLimit=1,LowerLimit=-1);
agentBlk = mdl + "/RL Agent";
simEnv = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
numObs = prod(obsInfo.Dimension);
criticLayerSizes = [400 300];
actorLayerSizes = [400 300];
% critic:
criticNetwork = [
featureInputLayer(numObs)
fullyConnectedLayer(criticLayerSizes(1), …
Weights=sqrt(2/numObs)*…
(rand(criticLayerSizes(1),numObs)-0.5), …
Bias=1e-3*ones(criticLayerSizes(1),1))
reluLayer
fullyConnectedLayer(criticLayerSizes(2), …
Weights=sqrt(2/criticLayerSizes(1))*…
(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), …
Bias=1e-3*ones(criticLayerSizes(2),1))
reluLayer
fullyConnectedLayer(1, …
Weights=sqrt(2/criticLayerSizes(2))* …
(rand(1,criticLayerSizes(2))-0.5), …
Bias=1e-3)
];
criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)
critic = rlValueFunction(criticNetwork,obsInfo);
% actor:
% Input path layers
inPath = [
featureInputLayer( …
prod(obsInfo.Dimension), …
Name="netOin")
fullyConnectedLayer( …
prod(actInfo.Dimension), …
Name="infc")
];
% Path layers for mean value
meanPath = [
tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer(Name="scale", …
Scale=actInfo.UpperLimit)
];
% Path layers for standard deviations
% Using softplus layer to make them non negative
sdevPath = [
tanhLayer(Name="tanhStdv");
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="splus")
];
net = dlnetwork();
net = addLayers(net,inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
plot(net)
net = initialize(net);
summary(net)
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, …
ActionMeanOutputNames="scale",…
ActionStandardDeviationOutputNames="splus",…
ObservationInputNames="netOin");
actorOpts = rlOptimizerOptions(LearnRate=1e-4);
criticOpts = rlOptimizerOptions(LearnRate=1e-4);
agentOpts = rlPPOAgentOptions(…
ExperienceHorizon=600,…
ClipFactor=0.02,…
EntropyLossWeight=0.01,…
ActorOptimizerOptions=actorOpts,…
CriticOptimizerOptions=criticOpts,…
NumEpoch=3,…
AdvantageEstimateMethod="gae",…
GAEFactor=0.95,…
SampleTime=0.1,…
DiscountFactor=0.997);
agent = rlPPOAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(…
MaxEpisodes=20000,…
MaxStepsPerEpisode=600,…
Plots="training-progress",…
StopTrainingCriteria="AverageReward",…
StopTrainingValue=430,…
ScoreAveragingWindowLength=100);
trainingStats = train(agent, simEnv, trainOpts);
thanks in advanced!I am trying to train the rotatry inverted pendulum enviroment using a PPO agent. It’s working… but It’s reaching a limit and not learnign past this limit. I am not too sure why. Newbie to RL here so go easy on me :). I think it’s something to do with the yellow line, Q0. Also it could be reaching a local optima, but I don’t think this is the problem. I think the problem is with Q0 not getting past 100 and the agent not being able to extract more useful info. Hopefully, someone whith a little more experinace has something to say!
mdl = "rlQubeServoModel";
open_system(mdl)
theta_limit = 5*pi/8;
dtheta_limit = 30;
volt_limit = 12;
Ts = 0.005;
rng(22)
obsInfo = rlNumericSpec([7 1]);
actInfo = rlNumericSpec([1 1],UpperLimit=1,LowerLimit=-1);
agentBlk = mdl + "/RL Agent";
simEnv = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
numObs = prod(obsInfo.Dimension);
criticLayerSizes = [400 300];
actorLayerSizes = [400 300];
% critic:
criticNetwork = [
featureInputLayer(numObs)
fullyConnectedLayer(criticLayerSizes(1), …
Weights=sqrt(2/numObs)*…
(rand(criticLayerSizes(1),numObs)-0.5), …
Bias=1e-3*ones(criticLayerSizes(1),1))
reluLayer
fullyConnectedLayer(criticLayerSizes(2), …
Weights=sqrt(2/criticLayerSizes(1))*…
(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), …
Bias=1e-3*ones(criticLayerSizes(2),1))
reluLayer
fullyConnectedLayer(1, …
Weights=sqrt(2/criticLayerSizes(2))* …
(rand(1,criticLayerSizes(2))-0.5), …
Bias=1e-3)
];
criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)
critic = rlValueFunction(criticNetwork,obsInfo);
% actor:
% Input path layers
inPath = [
featureInputLayer( …
prod(obsInfo.Dimension), …
Name="netOin")
fullyConnectedLayer( …
prod(actInfo.Dimension), …
Name="infc")
];
% Path layers for mean value
meanPath = [
tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer(Name="scale", …
Scale=actInfo.UpperLimit)
];
% Path layers for standard deviations
% Using softplus layer to make them non negative
sdevPath = [
tanhLayer(Name="tanhStdv");
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="splus")
];
net = dlnetwork();
net = addLayers(net,inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
plot(net)
net = initialize(net);
summary(net)
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, …
ActionMeanOutputNames="scale",…
ActionStandardDeviationOutputNames="splus",…
ObservationInputNames="netOin");
actorOpts = rlOptimizerOptions(LearnRate=1e-4);
criticOpts = rlOptimizerOptions(LearnRate=1e-4);
agentOpts = rlPPOAgentOptions(…
ExperienceHorizon=600,…
ClipFactor=0.02,…
EntropyLossWeight=0.01,…
ActorOptimizerOptions=actorOpts,…
CriticOptimizerOptions=criticOpts,…
NumEpoch=3,…
AdvantageEstimateMethod="gae",…
GAEFactor=0.95,…
SampleTime=0.1,…
DiscountFactor=0.997);
agent = rlPPOAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(…
MaxEpisodes=20000,…
MaxStepsPerEpisode=600,…
Plots="training-progress",…
StopTrainingCriteria="AverageReward",…
StopTrainingValue=430,…
ScoreAveragingWindowLength=100);
trainingStats = train(agent, simEnv, trainOpts);
thanks in advanced! I am trying to train the rotatry inverted pendulum enviroment using a PPO agent. It’s working… but It’s reaching a limit and not learnign past this limit. I am not too sure why. Newbie to RL here so go easy on me :). I think it’s something to do with the yellow line, Q0. Also it could be reaching a local optima, but I don’t think this is the problem. I think the problem is with Q0 not getting past 100 and the agent not being able to extract more useful info. Hopefully, someone whith a little more experinace has something to say!
mdl = "rlQubeServoModel";
open_system(mdl)
theta_limit = 5*pi/8;
dtheta_limit = 30;
volt_limit = 12;
Ts = 0.005;
rng(22)
obsInfo = rlNumericSpec([7 1]);
actInfo = rlNumericSpec([1 1],UpperLimit=1,LowerLimit=-1);
agentBlk = mdl + "/RL Agent";
simEnv = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
numObs = prod(obsInfo.Dimension);
criticLayerSizes = [400 300];
actorLayerSizes = [400 300];
% critic:
criticNetwork = [
featureInputLayer(numObs)
fullyConnectedLayer(criticLayerSizes(1), …
Weights=sqrt(2/numObs)*…
(rand(criticLayerSizes(1),numObs)-0.5), …
Bias=1e-3*ones(criticLayerSizes(1),1))
reluLayer
fullyConnectedLayer(criticLayerSizes(2), …
Weights=sqrt(2/criticLayerSizes(1))*…
(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), …
Bias=1e-3*ones(criticLayerSizes(2),1))
reluLayer
fullyConnectedLayer(1, …
Weights=sqrt(2/criticLayerSizes(2))* …
(rand(1,criticLayerSizes(2))-0.5), …
Bias=1e-3)
];
criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)
critic = rlValueFunction(criticNetwork,obsInfo);
% actor:
% Input path layers
inPath = [
featureInputLayer( …
prod(obsInfo.Dimension), …
Name="netOin")
fullyConnectedLayer( …
prod(actInfo.Dimension), …
Name="infc")
];
% Path layers for mean value
meanPath = [
tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer(Name="scale", …
Scale=actInfo.UpperLimit)
];
% Path layers for standard deviations
% Using softplus layer to make them non negative
sdevPath = [
tanhLayer(Name="tanhStdv");
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="splus")
];
net = dlnetwork();
net = addLayers(net,inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
plot(net)
net = initialize(net);
summary(net)
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, …
ActionMeanOutputNames="scale",…
ActionStandardDeviationOutputNames="splus",…
ObservationInputNames="netOin");
actorOpts = rlOptimizerOptions(LearnRate=1e-4);
criticOpts = rlOptimizerOptions(LearnRate=1e-4);
agentOpts = rlPPOAgentOptions(…
ExperienceHorizon=600,…
ClipFactor=0.02,…
EntropyLossWeight=0.01,…
ActorOptimizerOptions=actorOpts,…
CriticOptimizerOptions=criticOpts,…
NumEpoch=3,…
AdvantageEstimateMethod="gae",…
GAEFactor=0.95,…
SampleTime=0.1,…
DiscountFactor=0.997);
agent = rlPPOAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(…
MaxEpisodes=20000,…
MaxStepsPerEpisode=600,…
Plots="training-progress",…
StopTrainingCriteria="AverageReward",…
StopTrainingValue=430,…
ScoreAveragingWindowLength=100);
trainingStats = train(agent, simEnv, trainOpts);
thanks in advanced! ppo, rl, machine learning, ml, help, code, deep learning MATLAB Answers — New Questions