PPO reinforcement Learning Agent doesn’t learn
Hi, I am trying to design a reinforcement learning algorithm to perform a landing on the moon in a defined region.
The algorithm I implemented is a PPO with the environment designed in simulink. The model is designed as a continuous one. The action from RL Agent simulink block is the Thrust, the observation is the state (position and velocity). The Reward is also designed in a continuous way, with penalties outside some boundaries ("exteriorPenalty" function) and reward if inside boundaries (using exponential functions) plus some others penalties on velocities and action, properly weighted.
The model seems to work but the agent doesn’t learn as it was supposed to do. I played with PPO options to limit local minima and increase exploration to help finding optimal conditions. After lot of episodes the reward function is supposed to increase, but it varies between optimal values and the worst cases. I know that training can take lot of time due the large environment, on the other hand after a while I expect to see a better behavior. Especially if Reward values are soo high in some cases.
My questions are: how can I read in a properly way my plots of Reinforcement Learning Episode Manager? Which parameters should I change to help the agent to understand what is better? Any other comments are welcomed!
Thanks for helping!
Here my code for actor and critic generation with relative options:
actPath = [
sequenceInputLayer(numObs,’Normalization’,’none’,’Name’,’obs’)
fullyConnectedLayer(50,’Name’,’fc1act’)
dropoutLayer(0.2,’Name’,’drop1act’)
layerNormalizationLayer(‘Name’,’norm1act’)
reluLayer(‘Name’,’relu1act’)
lstmLayer(8,’OutputMode’,’sequence’,’Name’,’lstmact’)
layerNormalizationLayer(‘Name’,’norm2act’)
fullyConnectedLayer(2*numAct,’Name’,’fcoutput’)
layerNormalizationLayer(‘Name’,’norm3act’)
softmaxLayer(‘Name’,’SoftactionProb’)];
obsPath = [
sequenceInputLayer(numObs,’Normalization’,’none’,’Name’,’obs’)
fullyConnectedLayer(100, ‘Name’, ‘fc1obs’)
dropoutLayer(0.2,’Name’,’drop1obs’)
layerNormalizationLayer(‘Name’,’norm1obs’)
reluLayer(‘Name’,’relu1obs’)
fullyConnectedLayer(22, ‘Name’, ‘fc2obs’)
dropoutLayer(0.2,’Name’,’drop2obs’)
layerNormalizationLayer(‘Name’,’norm2obs’)
reluLayer(‘Name’,’relu2obs’)
fullyConnectedLayer(5, ‘Name’, ‘fc3obs’)
dropoutLayer(0.2,’Name’,’drop3obs’)
layerNormalizationLayer(‘Name’,’norm3obs’)
reluLayer(‘Name’,’relu3obs’)
lstmLayer(8,’OutputMode’,’sequence’,’Name’,’lstmobs’)
layerNormalizationLayer(‘Name’,’norm4obs’)
fullyConnectedLayer(1,’Name’,’fcvalue’)];
opts1 = rlRepresentationOptions("Learnrate",5e-3,"GradientThreshold",10,"UseDevice","gpu");
actor = rlStochasticActorRepresentation(actPath,obsInfo,actInfo,’Observation’,’obs’,opts1)
critic = rlValueRepresentation(obsPath,obsInfo,’Observation’,’obs’,opts1)
opts2 = rlPPOAgentOptions( "ExperienceHorizon",200,…
"SampleTIme",0.25, …
"MiniBatchSize",32, …
"EntropyLossWeight",0.5, …
"AdvantageEstimateMethod","gae", …
"GAEFactor",0.8, …
"NormalizedAdvantageMethod","current");
agent = rlPPOAgent(actor,critic,opts2)Hi, I am trying to design a reinforcement learning algorithm to perform a landing on the moon in a defined region.
The algorithm I implemented is a PPO with the environment designed in simulink. The model is designed as a continuous one. The action from RL Agent simulink block is the Thrust, the observation is the state (position and velocity). The Reward is also designed in a continuous way, with penalties outside some boundaries ("exteriorPenalty" function) and reward if inside boundaries (using exponential functions) plus some others penalties on velocities and action, properly weighted.
The model seems to work but the agent doesn’t learn as it was supposed to do. I played with PPO options to limit local minima and increase exploration to help finding optimal conditions. After lot of episodes the reward function is supposed to increase, but it varies between optimal values and the worst cases. I know that training can take lot of time due the large environment, on the other hand after a while I expect to see a better behavior. Especially if Reward values are soo high in some cases.
My questions are: how can I read in a properly way my plots of Reinforcement Learning Episode Manager? Which parameters should I change to help the agent to understand what is better? Any other comments are welcomed!
Thanks for helping!
Here my code for actor and critic generation with relative options:
actPath = [
sequenceInputLayer(numObs,’Normalization’,’none’,’Name’,’obs’)
fullyConnectedLayer(50,’Name’,’fc1act’)
dropoutLayer(0.2,’Name’,’drop1act’)
layerNormalizationLayer(‘Name’,’norm1act’)
reluLayer(‘Name’,’relu1act’)
lstmLayer(8,’OutputMode’,’sequence’,’Name’,’lstmact’)
layerNormalizationLayer(‘Name’,’norm2act’)
fullyConnectedLayer(2*numAct,’Name’,’fcoutput’)
layerNormalizationLayer(‘Name’,’norm3act’)
softmaxLayer(‘Name’,’SoftactionProb’)];
obsPath = [
sequenceInputLayer(numObs,’Normalization’,’none’,’Name’,’obs’)
fullyConnectedLayer(100, ‘Name’, ‘fc1obs’)
dropoutLayer(0.2,’Name’,’drop1obs’)
layerNormalizationLayer(‘Name’,’norm1obs’)
reluLayer(‘Name’,’relu1obs’)
fullyConnectedLayer(22, ‘Name’, ‘fc2obs’)
dropoutLayer(0.2,’Name’,’drop2obs’)
layerNormalizationLayer(‘Name’,’norm2obs’)
reluLayer(‘Name’,’relu2obs’)
fullyConnectedLayer(5, ‘Name’, ‘fc3obs’)
dropoutLayer(0.2,’Name’,’drop3obs’)
layerNormalizationLayer(‘Name’,’norm3obs’)
reluLayer(‘Name’,’relu3obs’)
lstmLayer(8,’OutputMode’,’sequence’,’Name’,’lstmobs’)
layerNormalizationLayer(‘Name’,’norm4obs’)
fullyConnectedLayer(1,’Name’,’fcvalue’)];
opts1 = rlRepresentationOptions("Learnrate",5e-3,"GradientThreshold",10,"UseDevice","gpu");
actor = rlStochasticActorRepresentation(actPath,obsInfo,actInfo,’Observation’,’obs’,opts1)
critic = rlValueRepresentation(obsPath,obsInfo,’Observation’,’obs’,opts1)
opts2 = rlPPOAgentOptions( "ExperienceHorizon",200,…
"SampleTIme",0.25, …
"MiniBatchSize",32, …
"EntropyLossWeight",0.5, …
"AdvantageEstimateMethod","gae", …
"GAEFactor",0.8, …
"NormalizedAdvantageMethod","current");
agent = rlPPOAgent(actor,critic,opts2) Hi, I am trying to design a reinforcement learning algorithm to perform a landing on the moon in a defined region.
The algorithm I implemented is a PPO with the environment designed in simulink. The model is designed as a continuous one. The action from RL Agent simulink block is the Thrust, the observation is the state (position and velocity). The Reward is also designed in a continuous way, with penalties outside some boundaries ("exteriorPenalty" function) and reward if inside boundaries (using exponential functions) plus some others penalties on velocities and action, properly weighted.
The model seems to work but the agent doesn’t learn as it was supposed to do. I played with PPO options to limit local minima and increase exploration to help finding optimal conditions. After lot of episodes the reward function is supposed to increase, but it varies between optimal values and the worst cases. I know that training can take lot of time due the large environment, on the other hand after a while I expect to see a better behavior. Especially if Reward values are soo high in some cases.
My questions are: how can I read in a properly way my plots of Reinforcement Learning Episode Manager? Which parameters should I change to help the agent to understand what is better? Any other comments are welcomed!
Thanks for helping!
Here my code for actor and critic generation with relative options:
actPath = [
sequenceInputLayer(numObs,’Normalization’,’none’,’Name’,’obs’)
fullyConnectedLayer(50,’Name’,’fc1act’)
dropoutLayer(0.2,’Name’,’drop1act’)
layerNormalizationLayer(‘Name’,’norm1act’)
reluLayer(‘Name’,’relu1act’)
lstmLayer(8,’OutputMode’,’sequence’,’Name’,’lstmact’)
layerNormalizationLayer(‘Name’,’norm2act’)
fullyConnectedLayer(2*numAct,’Name’,’fcoutput’)
layerNormalizationLayer(‘Name’,’norm3act’)
softmaxLayer(‘Name’,’SoftactionProb’)];
obsPath = [
sequenceInputLayer(numObs,’Normalization’,’none’,’Name’,’obs’)
fullyConnectedLayer(100, ‘Name’, ‘fc1obs’)
dropoutLayer(0.2,’Name’,’drop1obs’)
layerNormalizationLayer(‘Name’,’norm1obs’)
reluLayer(‘Name’,’relu1obs’)
fullyConnectedLayer(22, ‘Name’, ‘fc2obs’)
dropoutLayer(0.2,’Name’,’drop2obs’)
layerNormalizationLayer(‘Name’,’norm2obs’)
reluLayer(‘Name’,’relu2obs’)
fullyConnectedLayer(5, ‘Name’, ‘fc3obs’)
dropoutLayer(0.2,’Name’,’drop3obs’)
layerNormalizationLayer(‘Name’,’norm3obs’)
reluLayer(‘Name’,’relu3obs’)
lstmLayer(8,’OutputMode’,’sequence’,’Name’,’lstmobs’)
layerNormalizationLayer(‘Name’,’norm4obs’)
fullyConnectedLayer(1,’Name’,’fcvalue’)];
opts1 = rlRepresentationOptions("Learnrate",5e-3,"GradientThreshold",10,"UseDevice","gpu");
actor = rlStochasticActorRepresentation(actPath,obsInfo,actInfo,’Observation’,’obs’,opts1)
critic = rlValueRepresentation(obsPath,obsInfo,’Observation’,’obs’,opts1)
opts2 = rlPPOAgentOptions( "ExperienceHorizon",200,…
"SampleTIme",0.25, …
"MiniBatchSize",32, …
"EntropyLossWeight",0.5, …
"AdvantageEstimateMethod","gae", …
"GAEFactor",0.8, …
"NormalizedAdvantageMethod","current");
agent = rlPPOAgent(actor,critic,opts2) ppo, deep learning, reinforcement learning, reward MATLAB Answers — New Questions