SAC agents

Soft Actor Critic based agent



 SACBaseAgent (environment_info:ddopai.utils.MDPInfo,
               initial_replay_size:int=64, max_replay_size:int=50000,
               batch_size:int=64, warmup_transitions:int=100,
               lr_alpha:float=0.0003, tau:float=0.005,
               log_std_min:float=-20.0, log_std_max:float=2.0,
               use_log_alpha_loss=False, target_entropy:float|None=None,
               drop_prob:float=0.0, batch_norm:bool=False,
               init_method:str='xavier_uniform', optimizer:str='Adam',
               loss:str='MSE', obsprocessors:list|None=None,
               device:str='cpu', agent_name:str|None='SAC',

Base agent for the Soft Actor-Critic (SAC) algorithm.

Type Default Details
environment_info MDPInfo
learning_rate_actor float 0.0003
learning_rate_critic float | None None If none, then it is set to learning_rate_actor
initial_replay_size int 64
max_replay_size int 50000
batch_size int 64
warmup_transitions int 100
lr_alpha float 0.0003
tau float 0.005
log_std_min float -20.0
log_std_max float 2.0
use_log_alpha_loss bool False
target_entropy float | None None
drop_prob float 0.0
batch_norm bool False
init_method str xavier_uniform “xavier_uniform”, “xavier_normal”, “he_normal”, “he_uniform”, “normal”, “uniform”
optimizer str Adam “Adam” or “SGD” or “RMSprop”
loss str MSE currently only MSE is supported
obsprocessors list | None None default: []
device str cpu “cuda” or “cpu”
agent_name str | None SAC
network_actor_mu_params dict None
network_actor_sigma_params dict None
network_critic_params dict None



 SACAgent (environment_info:ddopai.utils.MDPInfo, hidden_layers:List=None,
           activation:str='relu', learning_rate_actor:float=0.0003,
           initial_replay_size:int=64, max_replay_size:int=50000,
           batch_size:int=64, warmup_transitions:int=100,
           lr_alpha:float=0.0003, tau:float=0.005,
           log_std_min:float=-20.0, log_std_max:float=2.0,
           use_log_alpha_loss=False, target_entropy:float|None=None,
           drop_prob:float=0.0, batch_norm:bool=False,
           init_method:str='xavier_uniform', optimizer:str='Adam',
           loss:str='MSE', obsprocessors:list|None=None, device:str='cpu',
           agent_name:str|None='SAC', observation_space_shape=None,


Type Default Details
environment_info MDPInfo
hidden_layers List None if None, then default is [64, 64]
activation str relu “relu”, “sigmoid”, “tanh”, “leakyrelu”, “elu”
learning_rate_actor float 0.0003
learning_rate_critic float | None None If none, then it is set to learning_rate_actor
initial_replay_size int 64
max_replay_size int 50000
batch_size int 64
warmup_transitions int 100
lr_alpha float 0.0003
tau float 0.005
log_std_min float -20.0
log_std_max float 2.0
use_log_alpha_loss bool False
target_entropy float | None None
drop_prob float 0.0
batch_norm bool False
init_method str xavier_uniform “xavier_uniform”, “xavier_normal”, “he_normal”, “he_uniform”, “normal”, “uniform”
optimizer str Adam “Adam” or “SGD” or “RMSprop”
loss str MSE currently only MSE is supported
obsprocessors list | None None default: []
device str cpu “cuda” or “cpu”
agent_name str | None SAC
observation_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
action_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
from ddopai.envs.inventory.single_period import NewsvendorEnv
from ddopai.dataloaders.tabular import XYDataLoader
from ddopai.experiments.experiment_functions import run_experiment, test_agent
val_index_start = 8000 #90_000
test_index_start = 9000 #100_000

X = np.random.standard_normal((10000, 2))
Y = np.random.standard_normal((10000, 1))
Y += 2*X[:,0].reshape(-1, 1) + 3*X[:,1].reshape(-1, 1)
Y = X[:,0].reshape(-1, 1)
# truncate Y at 0:
Y = np.maximum(Y, 0)
# normalize Y max to 1
Y = Y/np.max(Y)

clip_action = ClipAction(0., 1.)

dataloader = XYDataLoader(X, Y, val_index_start, test_index_start, lag_window_params =  {'lag_window': 0, 'include_y': False, 'pre_calc': True})

environment = NewsvendorEnv(
    dataloader = dataloader,
    underage_cost = 0.42857,
    overage_cost = 1.0,
    gamma = 0.999,
    horizon_train = 365,
    q_bound_high = 1.0,
    q_bound_low = -0.1,
    postprocessors = [clip_action],

agent = SACAgent(environment.mdp_info,
                obsprocessors = None,      # default: []
                device="cpu", # "cuda" or "cpu"


R, J = test_agent(agent, environment)

print(R, J)


R, J = test_agent(agent, environment)

print(R, J)
INFO:root:Actor network (mu network):
Layer (type:depth-idx)                   Output Shape              Param #
MLPActor                                 [1, 1]                    --
├─Sequential: 1-1                        [1, 1]                    --
│    └─Linear: 2-1                       [1, 64]                   192
│    └─ReLU: 2-2                         [1, 64]                   --
│    └─Dropout: 2-3                      [1, 64]                   --
│    └─Linear: 2-4                       [1, 64]                   4,160
│    └─ReLU: 2-5                         [1, 64]                   --
│    └─Dropout: 2-6                      [1, 64]                   --
│    └─Linear: 2-7                       [1, 1]                    65
│    └─Identity: 2-8                     [1, 1]                    --
Total params: 4,417
Trainable params: 4,417
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.02
Estimated Total Size (MB): 0.02
INFO:root:Critic network:
Layer (type:depth-idx)                   Output Shape              Param #
MLPStateAction                           --                        --
├─Sequential: 1-1                        [1, 1]                    --
│    └─Linear: 2-1                       [1, 64]                   256
│    └─ReLU: 2-2                         [1, 64]                   --
│    └─Dropout: 2-3                      [1, 64]                   --
│    └─Linear: 2-4                       [1, 64]                   4,160
│    └─ReLU: 2-5                         [1, 64]                   --
│    └─Dropout: 2-6                      [1, 64]                   --
│    └─Linear: 2-7                       [1, 1]                    65
│    └─Identity: 2-8                     [1, 1]                    --
Total params: 4,481
Trainable params: 4,481
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.02
Estimated Total Size (MB): 0.02
-245.3059010258002 -154.16627214771364
-245.3059010258002 -154.16627214771364



 SACRNNAgent (environment_info:ddopai.utils.MDPInfo,
              hidden_layers_RNN:int=1, num_hidden_units_RNN:int=64,
              RNN_cell:str='GRU', hidden_layers_MLP:List=None,
              hidden_layers_input_MLP:List=None, activation:str='relu',
              initial_replay_size:int=64, max_replay_size:int=50000,
              batch_size:int=64, warmup_transitions:int=100,
              lr_alpha:float=0.0003, tau:float=0.005,
              log_std_min:float=-20.0, log_std_max:float=2.0,
              use_log_alpha_loss=False, target_entropy:float|None=None,
              drop_prob:float=0.0, batch_norm:bool=False,
              init_method:str='xavier_uniform', optimizer:str='Adam',
              loss:str='MSE', obsprocessors:list|None=None,
              device:str='cpu', agent_name:str|None='SAC',
              observation_space_shape=None, action_space_shape=None)


Type Default Details
environment_info MDPInfo
hidden_layers_RNN int 1 Initial RNN layers
num_hidden_units_RNN int 64 Initial number of hidden units in RNN layers
RNN_cell str GRU “LSTM”, “GRU”, “RNN”
hidden_layers_MLP List None MLP layers behind RNN: if None, then default is [64, 64]
hidden_layers_input_MLP List None MLP layers for non-time features. Default is None
activation str relu “relu”, “sigmoid”, “tanh”, “leakyrelu”, “elu”
learning_rate_actor float 0.0003
learning_rate_critic float | None None If none, then it is set to learning_rate_actor
initial_replay_size int 64
max_replay_size int 50000
batch_size int 64
warmup_transitions int 100
lr_alpha float 0.0003
tau float 0.005
log_std_min float -20.0
log_std_max float 2.0
use_log_alpha_loss bool False
target_entropy float | None None
drop_prob float 0.0
batch_norm bool False
init_method str xavier_uniform “xavier_uniform”, “xavier_normal”, “he_normal”, “he_uniform”, “normal”, “uniform”
optimizer str Adam “Adam” or “SGD” or “RMSprop”
loss str MSE currently only MSE is supported
obsprocessors list | None None default: []
device str cpu “cuda” or “cpu”
agent_name str | None SAC
observation_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
action_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
from ddopai.envs.inventory.single_period import NewsvendorEnv
from ddopai.dataloaders.tabular import XYDataLoader
from ddopai.experiments.experiment_functions import run_experiment, test_agent
val_index_start = 8000 #90_000
test_index_start = 9000 #100_000

X = np.random.standard_normal((10000, 2))
Y = np.random.standard_normal((10000, 1))
Y += 2*X[:,0].reshape(-1, 1) + 3*X[:,1].reshape(-1, 1)
Y = X[:,0].reshape(-1, 1)
# truncate Y at 0:
Y = np.maximum(Y, 0)
# normalize Y max to 1
Y = Y/np.max(Y)

clip_action = ClipAction(0., 1.)

dataloader = XYDataLoader(X, Y, val_index_start, test_index_start, lag_window_params =  {'lag_window': 5, 'include_y': True, 'pre_calc': True})

environment = NewsvendorEnv(
    dataloader = dataloader,
    underage_cost = 0.42857,
    overage_cost = 1.0,
    gamma = 0.999,
    horizon_train = 365,
    q_bound_high = 1.0,
    q_bound_low = -0.1,
    postprocessors = [clip_action],

agent = SACRNNAgent(environment.mdp_info,
                obsprocessors = None,      # default: []
                device="cpu", # "cuda" or "cpu"


R, J = test_agent(agent, environment)

print(R, J)


R, J = test_agent(agent, environment)

print(R, J)
INFO:root:Actor network (mu network):
Layer (type:depth-idx)                   Output Shape              Param #
RNNActor                                 [1, 1]                    --
├─RNNMLPHybrid: 1-1                      [1, 1]                    --
│    └─Sequential: 2-1                   [1, 6, 64]                --
│    │    └─SpecificRNNWrapper: 3-1      [1, 6, 64]                13,248
│    │    └─ReLU: 3-2                    [1, 6, 64]                --
│    └─Sequential: 2-2                   [1, 1]                    --
│    │    └─Linear: 3-3                  [1, 64]                   4,160
│    │    └─ReLU: 3-4                    [1, 64]                   --
│    │    └─Dropout: 3-5                 [1, 64]                   --
│    │    └─Linear: 3-6                  [1, 64]                   4,160
│    │    └─ReLU: 3-7                    [1, 64]                   --
│    │    └─Dropout: 3-8                 [1, 64]                   --
│    │    └─Linear: 3-9                  [1, 1]                    65
Total params: 21,633
Trainable params: 21,633
Non-trainable params: 0
Total mult-adds (M): 0.09
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.09
Estimated Total Size (MB): 0.09
INFO:root:Critic network:
Layer (type:depth-idx)                   Output Shape              Param #
RNNStateAction                           --                        --
├─RNNMLPHybrid: 1-1                      [1, 1]                    --
│    └─Sequential: 2-1                   [1, 6, 64]                --
│    │    └─SpecificRNNWrapper: 3-1      [1, 6, 64]                13,248
│    │    └─ReLU: 3-2                    [1, 6, 64]                --
│    └─Sequential: 2-2                   [1, 1]                    --
│    │    └─Linear: 3-3                  [1, 64]                   4,224
│    │    └─ReLU: 3-4                    [1, 64]                   --
│    │    └─Dropout: 3-5                 [1, 64]                   --
│    │    └─Linear: 3-6                  [1, 64]                   4,160
│    │    └─ReLU: 3-7                    [1, 64]                   --
│    │    └─Dropout: 3-8                 [1, 64]                   --
│    │    └─Linear: 3-9                  [1, 1]                    65
Total params: 21,697
Trainable params: 21,697
Non-trainable params: 0
Total mult-adds (M): 0.09
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.09
Estimated Total Size (MB): 0.09
-383.1306977574299 -243.60956423506602
-383.1306977574299 -243.60956423506602