How can I save an output of a customized step function in Reinforcement learning?
I have created a code for training a DQN agent with a customized enviroment using step and reset function following the example in the docuemntation. However I would like to be able to store the info about the state in the step function to investigate them after the training and after simulating the agent in the enviroment. I only know how to get the info about action and observation but I would kike also the state that now is a field of the structure LoggedSignals. I attach the main code and the step function and the reset function.
clear
clc
close all
load(‘ws_lorenz’,’tot_T’)
%% Create Environment Interface
% rlNumericSpec([n,1]) specifies that the state variables are n and can
% take any value in R.
obsInfo = rlNumericSpec([1 1]);
obsInfo.Name = ‘reactivity’;
obsInfo.Description = ‘r’;
u_1 = [0.1 2];
my_cell = reshape(num2cell(u_1),1,length(u_1));
actInfo = rlFiniteSetSpec(my_cell);
actInfo.Name = ‘System Action’;
% now we are ready to define the environment.
%doc rlSimulinkEnv Create reinforcement learning environment using dynamic model implemented in Simulink
%doc rlFunctionEnv Specify custom reinforcement learning environment dynamics using functions
env = rlFunctionEnv(obsInfo,actInfo,’my_stepfun’,’my_resetfun’);
% Fix the random generator seed for reproducibility.
rng(0)
%% Create DQN agent
%A DQN agent approximates the long-term reward given observations and
%actions using a critic value function representation.
%To create the critic, first create a deep neural network with the state as
% an input and as many outputs as the different values the control action
% can take (this is the size of the cell). The idea here is to obtain a
% different parametric approximator of the Q-factor for each value of u.
net = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(256)
reluLayer
fullyConnectedLayer(length(actInfo.Elements))
];
net = dlnetwork(net);
summary(net)
% Plot network
plot(net)
% Specify options for the critic. The LearnRate is key, the higher it is, the
% faster the training but potentially the less accurate the results.
criticOptions = rlOptimizerOptions( …
LearnRate=1e-3, …
GradientThreshold=1);
%specify the action and observation info for the critic, which you obtain
%from the environment interface.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
% A vector Q-value Function is a neural network allowing to obtain a
% different parametric approximator of the Q-factor for each value of u.
critic = rlVectorQValueFunction(net,obsInfo,actInfo);
%To create the DQN agent, first specify the DQN agent options using rlDQNAgentOptions.
agentOpts = rlDQNAgentOptions(…
‘UseDoubleDQN’,true, …
‘TargetUpdateMethod’,"periodic", …
‘TargetUpdateFrequency’,10, …
‘ExperienceBufferLength’,100000, …
‘DiscountFactor’,0.95, …
‘MiniBatchSize’,128, …
CriticOptimizerOptions=criticOptions);
agentOpts.EpsilonGreedyExploration.Epsilon = 0.8;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
%Then, create the DQN agent using the specified critic representation
%and agent options.
agent = rlDQNAgent(critic,agentOpts);
%% Train Agent
%To train the agent, first specify the training options.
%Run one training session containing at most 1000 episodes,
%with each episode lasting at most 500 time steps.
%Display the training progress in the Episode Manager dialog box
%and disable the command line display (set the Verbose option to false).
%Stop training when the agent receives an moving average cumulative reward
%greater than 15000.
trainOpts = rlTrainingOptions(…
‘MaxEpisodes’, 10000, … % if the number of steps per episode is increased, this could be decreased.
‘MaxStepsPerEpisode’, tot_T, … % this number of steps per episode might be insufficient in general
‘Verbose’, false, …
‘Plots’,’training-progress’,…
‘StopTrainingCriteria’,’AverageReward’,…
‘StopTrainingValue’,1, …
UseParallel=false);
%% Train the agent using the train function.
trainingStats = train(agent,env,trainOpts);
%% Simulate DQN Agent
%To validate the performance of the trained agent, simulate it within the
% environment.
experience = sim(env,agent);
totalReward = sum(experience.Reward)
figure(1)
x = squeeze(experience.Action.SystemAction.Data(:,1,:));%%1x1x258
plot(x’)
plot(squeeze(experience.Action.SystemAction.Data));
title(‘Actions Over Time’);
react = squeeze(experience.Observation.reactivity.Data(:,1,:)); %%1x1x259
figure(2)
plot(react’)
title(‘Reactivity Over Time’);
figure(3)
plot(trainingStats.EpisodeIndex, trainingStats.AverageReward);
xlabel(‘Episode’);
ylabel(‘Average Reward’);
function [NextObs,Reward,IsDone,LoggedSignals]…
= my_stepfun(Action,LoggedSignals)
% Custom step function.
%[NextObservation,Reward,IsDone,UpdatedInfo] = myStepFunction(Action,Info)
% This function applies the given action to the environment and evaluates
% the system dynamics for one simulation step.
% Define the environment constants.
% Sample time
Ts = 1;
sig = 1.3;
DF= LoggedSignals.DF ;
L = LoggedSignals.L;
H = LoggedSignals.H;
xi = LoggedSignals.xi;
m = LoggedSignals.m;
n = LoggedSignals.n;
tot_T = LoggedSignals.tot_T;
LoggedSignals.Time = LoggedSignals.Time+Ts;
kk = (1/Ts)*LoggedSignals.Time;
u = Action;
% Unpack the state vector from the logged signals.
x_k = LoggedSignals.State;
% Perform Euler integration.
[t, x] = ode113(@(t,x)my_lorenz_DQN(t,x,L,u, DF, H),[0 Ts],x_k’);
LoggedSignals.State = x(end,:)’;
% compute average state
St = [mean(x(end,1:n),2), mean(x(end,n+1:2*n),2), mean(x(end,2*n+1:3*n),2)];
% compute reactivity (using sig)
r = max(eig((DF(St) + DF(St)’)/2 +sig*xi*H));
% The next observation is the reactivity
NextObs = r;
% Check early termination condition.
[err, ~, ~] = Err_sync(x, t, n, m, 0);
if LoggedSignals.Time >= 0.9*tot_T
LoggedSignals.cum_err = LoggedSignals.cum_err+err;
end
IsDone1 = LoggedSignals.cum_err>(20*eps);
IsDone2 = err>1e-1;
w1 = 1e5;
w2 = 1e2;
if IsDone1==1
Reward = -(tot_T-LoggedSignals.Time)*1e3;
elseif IsDone2==1
Reward = -(tot_T-LoggedSignals.Time)*1e4;
else
Reward = 1 -w1*err – w2*u;
end
IsDone = max(IsDone1,IsDone2) ;
end
function [InitialObservation, LoggedSignal] = my_resetfun()
load(‘reset_ws.mat’,’x0′)
load(‘ws_lorenz’,’DF’,’L’,’H’,’xi’,’n’,’m’,’tot_T’)
x = x0(:,randi(size(x0,2)));
LoggedSignal.State = x;
InitialObservation = 1; %% da cambiare
LoggedSignal.Time = 0;
LoggedSignal.DF = DF;
LoggedSignal.L = L;
LoggedSignal.H = H;
LoggedSignal.xi = xi;
LoggedSignal.m = m;
LoggedSignal.n = n;
LoggedSignal.cum_err = 0;
LoggedSignal.tot_T = tot_T;
endI have created a code for training a DQN agent with a customized enviroment using step and reset function following the example in the docuemntation. However I would like to be able to store the info about the state in the step function to investigate them after the training and after simulating the agent in the enviroment. I only know how to get the info about action and observation but I would kike also the state that now is a field of the structure LoggedSignals. I attach the main code and the step function and the reset function.
clear
clc
close all
load(‘ws_lorenz’,’tot_T’)
%% Create Environment Interface
% rlNumericSpec([n,1]) specifies that the state variables are n and can
% take any value in R.
obsInfo = rlNumericSpec([1 1]);
obsInfo.Name = ‘reactivity’;
obsInfo.Description = ‘r’;
u_1 = [0.1 2];
my_cell = reshape(num2cell(u_1),1,length(u_1));
actInfo = rlFiniteSetSpec(my_cell);
actInfo.Name = ‘System Action’;
% now we are ready to define the environment.
%doc rlSimulinkEnv Create reinforcement learning environment using dynamic model implemented in Simulink
%doc rlFunctionEnv Specify custom reinforcement learning environment dynamics using functions
env = rlFunctionEnv(obsInfo,actInfo,’my_stepfun’,’my_resetfun’);
% Fix the random generator seed for reproducibility.
rng(0)
%% Create DQN agent
%A DQN agent approximates the long-term reward given observations and
%actions using a critic value function representation.
%To create the critic, first create a deep neural network with the state as
% an input and as many outputs as the different values the control action
% can take (this is the size of the cell). The idea here is to obtain a
% different parametric approximator of the Q-factor for each value of u.
net = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(256)
reluLayer
fullyConnectedLayer(length(actInfo.Elements))
];
net = dlnetwork(net);
summary(net)
% Plot network
plot(net)
% Specify options for the critic. The LearnRate is key, the higher it is, the
% faster the training but potentially the less accurate the results.
criticOptions = rlOptimizerOptions( …
LearnRate=1e-3, …
GradientThreshold=1);
%specify the action and observation info for the critic, which you obtain
%from the environment interface.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
% A vector Q-value Function is a neural network allowing to obtain a
% different parametric approximator of the Q-factor for each value of u.
critic = rlVectorQValueFunction(net,obsInfo,actInfo);
%To create the DQN agent, first specify the DQN agent options using rlDQNAgentOptions.
agentOpts = rlDQNAgentOptions(…
‘UseDoubleDQN’,true, …
‘TargetUpdateMethod’,"periodic", …
‘TargetUpdateFrequency’,10, …
‘ExperienceBufferLength’,100000, …
‘DiscountFactor’,0.95, …
‘MiniBatchSize’,128, …
CriticOptimizerOptions=criticOptions);
agentOpts.EpsilonGreedyExploration.Epsilon = 0.8;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
%Then, create the DQN agent using the specified critic representation
%and agent options.
agent = rlDQNAgent(critic,agentOpts);
%% Train Agent
%To train the agent, first specify the training options.
%Run one training session containing at most 1000 episodes,
%with each episode lasting at most 500 time steps.
%Display the training progress in the Episode Manager dialog box
%and disable the command line display (set the Verbose option to false).
%Stop training when the agent receives an moving average cumulative reward
%greater than 15000.
trainOpts = rlTrainingOptions(…
‘MaxEpisodes’, 10000, … % if the number of steps per episode is increased, this could be decreased.
‘MaxStepsPerEpisode’, tot_T, … % this number of steps per episode might be insufficient in general
‘Verbose’, false, …
‘Plots’,’training-progress’,…
‘StopTrainingCriteria’,’AverageReward’,…
‘StopTrainingValue’,1, …
UseParallel=false);
%% Train the agent using the train function.
trainingStats = train(agent,env,trainOpts);
%% Simulate DQN Agent
%To validate the performance of the trained agent, simulate it within the
% environment.
experience = sim(env,agent);
totalReward = sum(experience.Reward)
figure(1)
x = squeeze(experience.Action.SystemAction.Data(:,1,:));%%1x1x258
plot(x’)
plot(squeeze(experience.Action.SystemAction.Data));
title(‘Actions Over Time’);
react = squeeze(experience.Observation.reactivity.Data(:,1,:)); %%1x1x259
figure(2)
plot(react’)
title(‘Reactivity Over Time’);
figure(3)
plot(trainingStats.EpisodeIndex, trainingStats.AverageReward);
xlabel(‘Episode’);
ylabel(‘Average Reward’);
function [NextObs,Reward,IsDone,LoggedSignals]…
= my_stepfun(Action,LoggedSignals)
% Custom step function.
%[NextObservation,Reward,IsDone,UpdatedInfo] = myStepFunction(Action,Info)
% This function applies the given action to the environment and evaluates
% the system dynamics for one simulation step.
% Define the environment constants.
% Sample time
Ts = 1;
sig = 1.3;
DF= LoggedSignals.DF ;
L = LoggedSignals.L;
H = LoggedSignals.H;
xi = LoggedSignals.xi;
m = LoggedSignals.m;
n = LoggedSignals.n;
tot_T = LoggedSignals.tot_T;
LoggedSignals.Time = LoggedSignals.Time+Ts;
kk = (1/Ts)*LoggedSignals.Time;
u = Action;
% Unpack the state vector from the logged signals.
x_k = LoggedSignals.State;
% Perform Euler integration.
[t, x] = ode113(@(t,x)my_lorenz_DQN(t,x,L,u, DF, H),[0 Ts],x_k’);
LoggedSignals.State = x(end,:)’;
% compute average state
St = [mean(x(end,1:n),2), mean(x(end,n+1:2*n),2), mean(x(end,2*n+1:3*n),2)];
% compute reactivity (using sig)
r = max(eig((DF(St) + DF(St)’)/2 +sig*xi*H));
% The next observation is the reactivity
NextObs = r;
% Check early termination condition.
[err, ~, ~] = Err_sync(x, t, n, m, 0);
if LoggedSignals.Time >= 0.9*tot_T
LoggedSignals.cum_err = LoggedSignals.cum_err+err;
end
IsDone1 = LoggedSignals.cum_err>(20*eps);
IsDone2 = err>1e-1;
w1 = 1e5;
w2 = 1e2;
if IsDone1==1
Reward = -(tot_T-LoggedSignals.Time)*1e3;
elseif IsDone2==1
Reward = -(tot_T-LoggedSignals.Time)*1e4;
else
Reward = 1 -w1*err – w2*u;
end
IsDone = max(IsDone1,IsDone2) ;
end
function [InitialObservation, LoggedSignal] = my_resetfun()
load(‘reset_ws.mat’,’x0′)
load(‘ws_lorenz’,’DF’,’L’,’H’,’xi’,’n’,’m’,’tot_T’)
x = x0(:,randi(size(x0,2)));
LoggedSignal.State = x;
InitialObservation = 1; %% da cambiare
LoggedSignal.Time = 0;
LoggedSignal.DF = DF;
LoggedSignal.L = L;
LoggedSignal.H = H;
LoggedSignal.xi = xi;
LoggedSignal.m = m;
LoggedSignal.n = n;
LoggedSignal.cum_err = 0;
LoggedSignal.tot_T = tot_T;
end I have created a code for training a DQN agent with a customized enviroment using step and reset function following the example in the docuemntation. However I would like to be able to store the info about the state in the step function to investigate them after the training and after simulating the agent in the enviroment. I only know how to get the info about action and observation but I would kike also the state that now is a field of the structure LoggedSignals. I attach the main code and the step function and the reset function.
clear
clc
close all
load(‘ws_lorenz’,’tot_T’)
%% Create Environment Interface
% rlNumericSpec([n,1]) specifies that the state variables are n and can
% take any value in R.
obsInfo = rlNumericSpec([1 1]);
obsInfo.Name = ‘reactivity’;
obsInfo.Description = ‘r’;
u_1 = [0.1 2];
my_cell = reshape(num2cell(u_1),1,length(u_1));
actInfo = rlFiniteSetSpec(my_cell);
actInfo.Name = ‘System Action’;
% now we are ready to define the environment.
%doc rlSimulinkEnv Create reinforcement learning environment using dynamic model implemented in Simulink
%doc rlFunctionEnv Specify custom reinforcement learning environment dynamics using functions
env = rlFunctionEnv(obsInfo,actInfo,’my_stepfun’,’my_resetfun’);
% Fix the random generator seed for reproducibility.
rng(0)
%% Create DQN agent
%A DQN agent approximates the long-term reward given observations and
%actions using a critic value function representation.
%To create the critic, first create a deep neural network with the state as
% an input and as many outputs as the different values the control action
% can take (this is the size of the cell). The idea here is to obtain a
% different parametric approximator of the Q-factor for each value of u.
net = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(256)
reluLayer
fullyConnectedLayer(length(actInfo.Elements))
];
net = dlnetwork(net);
summary(net)
% Plot network
plot(net)
% Specify options for the critic. The LearnRate is key, the higher it is, the
% faster the training but potentially the less accurate the results.
criticOptions = rlOptimizerOptions( …
LearnRate=1e-3, …
GradientThreshold=1);
%specify the action and observation info for the critic, which you obtain
%from the environment interface.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
% A vector Q-value Function is a neural network allowing to obtain a
% different parametric approximator of the Q-factor for each value of u.
critic = rlVectorQValueFunction(net,obsInfo,actInfo);
%To create the DQN agent, first specify the DQN agent options using rlDQNAgentOptions.
agentOpts = rlDQNAgentOptions(…
‘UseDoubleDQN’,true, …
‘TargetUpdateMethod’,"periodic", …
‘TargetUpdateFrequency’,10, …
‘ExperienceBufferLength’,100000, …
‘DiscountFactor’,0.95, …
‘MiniBatchSize’,128, …
CriticOptimizerOptions=criticOptions);
agentOpts.EpsilonGreedyExploration.Epsilon = 0.8;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
%Then, create the DQN agent using the specified critic representation
%and agent options.
agent = rlDQNAgent(critic,agentOpts);
%% Train Agent
%To train the agent, first specify the training options.
%Run one training session containing at most 1000 episodes,
%with each episode lasting at most 500 time steps.
%Display the training progress in the Episode Manager dialog box
%and disable the command line display (set the Verbose option to false).
%Stop training when the agent receives an moving average cumulative reward
%greater than 15000.
trainOpts = rlTrainingOptions(…
‘MaxEpisodes’, 10000, … % if the number of steps per episode is increased, this could be decreased.
‘MaxStepsPerEpisode’, tot_T, … % this number of steps per episode might be insufficient in general
‘Verbose’, false, …
‘Plots’,’training-progress’,…
‘StopTrainingCriteria’,’AverageReward’,…
‘StopTrainingValue’,1, …
UseParallel=false);
%% Train the agent using the train function.
trainingStats = train(agent,env,trainOpts);
%% Simulate DQN Agent
%To validate the performance of the trained agent, simulate it within the
% environment.
experience = sim(env,agent);
totalReward = sum(experience.Reward)
figure(1)
x = squeeze(experience.Action.SystemAction.Data(:,1,:));%%1x1x258
plot(x’)
plot(squeeze(experience.Action.SystemAction.Data));
title(‘Actions Over Time’);
react = squeeze(experience.Observation.reactivity.Data(:,1,:)); %%1x1x259
figure(2)
plot(react’)
title(‘Reactivity Over Time’);
figure(3)
plot(trainingStats.EpisodeIndex, trainingStats.AverageReward);
xlabel(‘Episode’);
ylabel(‘Average Reward’);
function [NextObs,Reward,IsDone,LoggedSignals]…
= my_stepfun(Action,LoggedSignals)
% Custom step function.
%[NextObservation,Reward,IsDone,UpdatedInfo] = myStepFunction(Action,Info)
% This function applies the given action to the environment and evaluates
% the system dynamics for one simulation step.
% Define the environment constants.
% Sample time
Ts = 1;
sig = 1.3;
DF= LoggedSignals.DF ;
L = LoggedSignals.L;
H = LoggedSignals.H;
xi = LoggedSignals.xi;
m = LoggedSignals.m;
n = LoggedSignals.n;
tot_T = LoggedSignals.tot_T;
LoggedSignals.Time = LoggedSignals.Time+Ts;
kk = (1/Ts)*LoggedSignals.Time;
u = Action;
% Unpack the state vector from the logged signals.
x_k = LoggedSignals.State;
% Perform Euler integration.
[t, x] = ode113(@(t,x)my_lorenz_DQN(t,x,L,u, DF, H),[0 Ts],x_k’);
LoggedSignals.State = x(end,:)’;
% compute average state
St = [mean(x(end,1:n),2), mean(x(end,n+1:2*n),2), mean(x(end,2*n+1:3*n),2)];
% compute reactivity (using sig)
r = max(eig((DF(St) + DF(St)’)/2 +sig*xi*H));
% The next observation is the reactivity
NextObs = r;
% Check early termination condition.
[err, ~, ~] = Err_sync(x, t, n, m, 0);
if LoggedSignals.Time >= 0.9*tot_T
LoggedSignals.cum_err = LoggedSignals.cum_err+err;
end
IsDone1 = LoggedSignals.cum_err>(20*eps);
IsDone2 = err>1e-1;
w1 = 1e5;
w2 = 1e2;
if IsDone1==1
Reward = -(tot_T-LoggedSignals.Time)*1e3;
elseif IsDone2==1
Reward = -(tot_T-LoggedSignals.Time)*1e4;
else
Reward = 1 -w1*err – w2*u;
end
IsDone = max(IsDone1,IsDone2) ;
end
function [InitialObservation, LoggedSignal] = my_resetfun()
load(‘reset_ws.mat’,’x0′)
load(‘ws_lorenz’,’DF’,’L’,’H’,’xi’,’n’,’m’,’tot_T’)
x = x0(:,randi(size(x0,2)));
LoggedSignal.State = x;
InitialObservation = 1; %% da cambiare
LoggedSignal.Time = 0;
LoggedSignal.DF = DF;
LoggedSignal.L = L;
LoggedSignal.H = H;
LoggedSignal.xi = xi;
LoggedSignal.m = m;
LoggedSignal.n = n;
LoggedSignal.cum_err = 0;
LoggedSignal.tot_T = tot_T;
end deep learning, save MATLAB Answers — New Questions