Matlab中设置强化学习奖励函数

RLer

Syntax

generateRewardFunction(mpcobj)

generateRewardFunction(blks)

generateRewardFunction(___,'FunctionName',myFcnName)

Description

generateRewardFunction(mpcobj) generates a MATLAB^® reward function based on the cost and constraints defined in the linear or nonlinear MPC object mpcobj. The generated reward function is displayed in a new editor window and you can use it as a starting point for reward design. You can tune the weights, use a different penalty function, and then use the resulting reward function within an environment to train an agent.

This syntax requires Model Predictive Control Toolbox™ software.

example

generateRewardFunction(blks) generates a MATLAB reward function based on performance constraints defined in the model verification blocks specified in the array of block paths blks.

This syntax requires Simulink^® Design Optimization™ software.

generateRewardFunction(___,'FunctionName',myFcnName) specifies the name of the generated reward function, and saves it into a file with the same name. It also overwrites any preexisting file with the same name in the current directory. Provide this name after either of the previous input arguments.

Examples

[collapse all](javascript:void(0)😉

Generate a Reward Function from MPC object

This example uses:

Copy Command Copy Code

This example shows how to generate a reinforcement learning reward function from an MPC object.

Define Plant and Create MPC Controller

Create a random plant using the rss function and set the feedthrough matrix to zero.

$$
plant = rss(4,3,2); plant.d = 0;
$$

Specify which of the plant signals are manipulated variables, measured disturbances, measured outputs and unmeasured outputs.

$$
plant = setmpcsignals(plant,'MV',1,'MD',2,'MO',[1 2],'UO',3);
$$

Create an MPC controller with a sample time of 0.1 and prediction and control horizons of 10 and 3 steps, respectively.

$$
mpcobj = mpc(plant,0.1,10,3);
$$

$$
-->The "Weights.ManipulatedVariables" property is empty. Assuming default 0.00000. -->The "Weights.ManipulatedVariablesRate" property is empty. Assuming default 0.10000. -->The "Weights.OutputVariables" property is empty. Assuming default 1.00000. for output(s) y1 and zero weight for output(s) y2 y3
$$

Set limits and a scale factor for the manipulated variable.

$$
mpcobj.ManipulatedVariables.Min = -2; mpcobj.ManipulatedVariables.Max = 2; mpcobj.ManipulatedVariables.ScaleFactor = 4;
$$

Set weights for the quadratic cost function.

$$
mpcobj.Weights.OutputVariables = [10 1 0.1]; mpcobj.Weights.ManipulatedVariablesRate = 0.2;
$$

Generate the Reward Function

Generate the reward function code from specifications in the mpc object using generateRewardFunction. The code is displayed in the MATLAB Editor.

$$
generateRewardFunction(mpcobj)
$$

For this example, the is saved in the MATLAB function file myMpcRewardFcn.m. Display the generated reward function.

$$
type myMpcRewardFcn.m
$$

$$
function reward = myMpcRewardFcn(y,refy,mv,refmv,lastmv) % MYMPCREWARDFCN generates rewards from MPC specifications. % % Description of input arguments: % % y : Output variable from plant at step k+1 % refy : Reference output variable at step k+1 % mv : Manipulated variable at step k % refmv : Reference manipulated variable at step k % lastmv : Manipulated variable at step k-1 % % Limitations (MPC and NLMPC): % - Reward computed based on first step in prediction horizon. % Therefore, signal previewing and control horizon settings are ignored. % - Online cost and constraint update is not supported. % - Custom cost and constraint specifications are not considered. % - Time varying cost weights and constraints are not supported. % - Mixed constraint specifications are not considered (for the MPC case). % Reinforcement Learning Toolbox % 27-May-2021 14:47:29 %#codegen %% Specifications from MPC object % Standard linear bounds as specified in 'States', 'OutputVariables', and % 'ManipulatedVariables' properties ymin = [-Inf -Inf -Inf]; ymax = [Inf Inf Inf]; mvmin = -2; mvmax = 2; mvratemin = -Inf; mvratemax = Inf; % Scale factors as specified in 'States', 'OutputVariables', and % 'ManipulatedVariables' properties Sy = [1 1 1]; Smv = 4; % Standard cost weights as specified in 'Weights' property Qy = [10 1 0.1]; Qmv = 0; Qmvrate = 0.2; %% Compute cost dy = (refy(🙂-y(🙂) ./ Sy'; dmv = (refmv(🙂-mv(🙂) ./ Smv'; dmvrate = (mv(🙂-lastmv(🙂) ./ Smv'; Jy = dy' * diag(Qy.²) * dy; Jmv = dmv' * diag(Qmv.²) * dmv; Jmvrate = dmvrate' * diag(Qmvrate.²) * dmvrate; Cost = Jy + Jmv + Jmvrate; %% Penalty function weight (specify nonnegative) Wy = [1 1 1]; Wmv = 10; Wmvrate = 10; %% Compute penalty % Penalty is computed for violation of linear bound constraints. % % To compute exterior bound penalty, use the exteriorPenalty function and % specify the penalty method as 'step' or 'quadratic'. % % Alternaltively, use the hyperbolicPenalty or barrierPenalty function for % computing hyperbolic and barrier penalties. % % For more information, see help for these functions. % % Set Pmv value to 0 if the RL agent action specification has % appropriate 'LowerLimit' and 'UpperLimit' values. Py = Wy * exteriorPenalty(y,ymin,ymax,'step'); Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'step'); Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step'); Penalty = Py + Pmv + Pmvrate; %% Compute reward reward = -(Cost + Penalty); end
$$

The calculated reward depends only on the current values of the plant input and output signals and their reference values, and it is composed of two parts.

The first is a negative cost that depends on the squared difference between desired and current plant inputs and outputs. This part uses the cost function weights specified in the MPC object. The second part is a penalty that acts as a negative reward whenever the current plant signals violate the constraints.

The generated reward function is a starting point for reward design. You can tune the weights or use a different penalty function to define a more appropriate reward for your reinforcement learning agent.

Generate Reward Function from Verification Block

This example uses:

Copy Command Copy Code

This example shows how to generate a reinforcement learning reward function from a Simulink Design Optimization model verification block.

For this example, open the Simulink model LevelCheckBlock.slx, which contains a Check Step Response Characteristics block named Level Check.

function reward = myMpcRewardFcn(y,refy,mv,refmv,lastmv)
% MYMPCREWARDFCN generates rewards from MPC specifications.
%
% Description of input arguments:
%
% y : Output variable from plant at step k+1
% refy : Reference output variable at step k+1
% mv : Manipulated variable at step k
% refmv : Reference manipulated variable at step k
% lastmv : Manipulated variable at step k-1
%
% Limitations (MPC and NLMPC):
%     - Reward computed based on first step in prediction horizon.
%       Therefore, signal previewing and control horizon settings are ignored.
%     - Online cost and constraint update is not supported.
%     - Custom cost and constraint specifications are not considered.
%     - Time varying cost weights and constraints are not supported.
%     - Mixed constraint specifications are not considered (for the MPC case).

% Reinforcement Learning Toolbox
% 27-May-2021 14:47:29

%#codegen

%% Specifications from MPC object
% Standard linear bounds as specified in 'States', 'OutputVariables', and
% 'ManipulatedVariables' properties
ymin = [-Inf -Inf -Inf];
ymax = [Inf Inf Inf];
mvmin = -2;
mvmax = 2;
mvratemin = -Inf;
mvratemax = Inf;

% Scale factors as specified in 'States', 'OutputVariables', and
% 'ManipulatedVariables' properties
Sy  = [1 1 1];
Smv = 4;

% Standard cost weights as specified in 'Weights' property
Qy      = [10 1 0.1];
Qmv     = 0;
Qmvrate = 0.2;

%% Compute cost
dy      = (refy(:)-y(:)) ./ Sy';
dmv     = (refmv(:)-mv(:)) ./ Smv';
dmvrate = (mv(:)-lastmv(:)) ./ Smv';
Jy      = dy'      * diag(Qy.^2)      * dy;
Jmv     = dmv'     * diag(Qmv.^2)     * dmv;
Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate;
Cost    = Jy + Jmv + Jmvrate;

%% Penalty function weight (specify nonnegative)
Wy = [1 1 1];
Wmv = 10;
Wmvrate = 10;

%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
%
% Set Pmv value to 0 if the RL agent action specification has
% appropriate 'LowerLimit' and 'UpperLimit' values.
Py      = Wy      * exteriorPenalty(y,ymin,ymax,'step');
Pmv     = Wmv     * exteriorPenalty(mv,mvmin,mvmax,'step');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step');
Penalty = Py + Pmv + Pmvrate;

%% Compute reward
reward = -(Cost + Penalty);
end

{"String":"Figure Check Step Response Characteristics [1] - Level Check contains an axes object and other objects of type uiflowcontainer, uimenu, uitoolbar. The axes object contains 9 objects of type patch, line.","Tex":[],"LaTex":[]}

Generate the reward function code from specifications in the Level Check block, using generateRewardFunction. The code is displayed in the MATLAB Editor.

For this example, the code is saved in the MATLAB function file myBlockRewardFcn.m.

Display the generated reward function.

type myBlockRewardFcn.m
function reward = myBlockRewardFcn(x,t)
% MYBLOCKREWARDFCN generates rewards from Simulink block specifications.
%
% x : Input of LevelCheckBlock/Level Check
% t : Simulation time (s)

% Reinforcement Learning Toolbox
% 27-May-2021 16:45:27

%#codegen

%% Specifications from LevelCheckBlock/Level Check
Block1_InitialValue = 1;
Block1_FinalValue = 2;
Block1_StepTime = 0;
Block1_StepRange = Block1_FinalValue - Block1_InitialValue;
Block1_MinRise = Block1_InitialValue + Block1_StepRange * 80/100;
Block1_MaxSettling = Block1_InitialValue + Block1_StepRange * (1+2/100);
Block1_MinSettling = Block1_InitialValue + Block1_StepRange * (1-2/100);
Block1_MaxOvershoot = Block1_InitialValue + Block1_StepRange * (1+10/100);
Block1_MinUndershoot = Block1_InitialValue - Block1_StepRange * 5/100;

if t >= Block1_StepTime
    if Block1_InitialValue <= Block1_FinalValue
        Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)];
        Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettling,Block1_MaxSettling];
        Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
        Block1_LowerBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,Block1_MinRise; Block1_MinSettling,Block1_MinSettling];
    else
        Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
        Block1_UpperBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,Block1_MinRise; Block1_MinSettling,Block1_MinSettling];
        Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)];
        Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettling,Block1_MaxSettling];
    end

    Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
    for idx = 1:numel(Block1_xmax)
        tseg = Block1_UpperBoundTimes(idx,:);
        xseg = Block1_UpperBoundAmplitudes(idx,:);
        Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
    end
    if all(isnan(Block1_xmax))
        Block1_xmax = Inf;
    else
        Block1_xmax = max(Block1_xmax,[],'omitnan');
    end

    Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
    for idx = 1:numel(Block1_xmin)
        tseg = Block1_LowerBoundTimes(idx,:);
        xseg = Block1_LowerBoundAmplitudes(idx,:);
        Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
    end
    if all(isnan(Block1_xmin))
        Block1_xmin = -Inf;
    else
        Block1_xmin = max(Block1_xmin,[],'omitnan');
    end
else
    Block1_xmin = -Inf;
    Block1_xmax = Inf;
end




%% Penalty function weight (specify nonnegative)
Weight = 1;

%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'step'));

%% Compute reward
reward = -Weight * Penalty;
end

The generated reward function takes as input arguments the current value of the verification block input signals and the simulation time. A negative reward is calculated using a weighted penalty that acts whenever the current block input signals violate the linear bound constraints defined in the verification block.

Close the Simulink model.

$$
close_system('LevelCheckBlock')
$$

Input Arguments

[collapse all](javascript:void(0)😉

`mpcobj` — Linear or nonlinear MPC object `mpc` object | `nlmpc` object

Linear or nonlinear MPC object, specified as an mpc (Model Predictive Control Toolbox) object or an nlmpc (Model Predictive Control Toolbox) object, respectively.

Note that:

The generated function calculates rewards using signal values at the current time only. Predicted future values, signal previewing, and control horizon settings are not used in the reward calculation.
Using time-varying cost weights and constraints, or updating them online, is not supported.
Only the standard quadratic cost function, as described in Optimization Problem (Model Predictive Control Toolbox), is supported. Therefore, for mpc objects, using mixed constraint specifications is not supported. Similarly, for nlmpc objects, custom cost and constraint specifications are not supported.

Example: mpc(tf([1 1],[1 2 0]),0.1)

`blks` — Path to model verification blocks `char` array | `cell` array | `string` array

Path to model verification blocks, specified as character array, cell array or string array. The supported Simulink Design Optimization model verification blocks are the following ones.

Check Against Reference (Simulink Design Optimization)
Check Custom Bounds (Simulink Design Optimization)
Check Step Response Characteristics (Simulink Design Optimization)
Any block belonging to the Simulink Model Verification (Simulink) library

Example: "mySimulinkModel02/Check Against Reference"

`myFcnName` — Function name `string` | `char` vector

Function name, specified as a string object or character vector.

Example: "reward03epf_step"

Tips

By default, the exterior bound penalty function exteriorPenalty is used to calculate the penalty. Alternatively, to calculate hyperbolic and barrier penalties, you can use the hyperbolicPenalty or barrierPenalty functions.

Version History

原文： Generate a reward function from control specifications to train a reinforcement learning agent - MATLAB generateRewardFunction - MathWorks 中国

Document