SAC agents

Soft Actor Critic based agent

source

SACBaseAgent

 SACBaseAgent (environment_info:ddopai.utils.MDPInfo,
               learning_rate_actor:float=0.0003,
               learning_rate_critic:float|None=None,
               initial_replay_size:int=64, max_replay_size:int=50000,
               batch_size:int=64, warmup_transitions:int=100,
               lr_alpha:float=0.0003, tau:float=0.005,
               log_std_min:float=-20.0, log_std_max:float=2.0,
               use_log_alpha_loss=False, target_entropy:float|None=None,
               drop_prob:float=0.0, batch_norm:bool=False,
               init_method:str='xavier_uniform', optimizer:str='Adam',
               loss:str='MSE', obsprocessors:list|None=None,
               device:str='cpu', agent_name:str|None='SAC',
               network_actor_mu_params:dict=None,
               network_actor_sigma_params:dict=None,
               network_critic_params:dict=None)

Base agent for the Soft Actor-Critic (SAC) algorithm.

Type Default Details
environment_info MDPInfo
learning_rate_actor float 0.0003
learning_rate_critic float | None None If none, then it is set to learning_rate_actor
initial_replay_size int 64
max_replay_size int 50000
batch_size int 64
warmup_transitions int 100
lr_alpha float 0.0003
tau float 0.005
log_std_min float -20.0
log_std_max float 2.0
use_log_alpha_loss bool False
target_entropy float | None None
drop_prob float 0.0
batch_norm bool False
init_method str xavier_uniform “xavier_uniform”, “xavier_normal”, “he_normal”, “he_uniform”, “normal”, “uniform”
optimizer str Adam “Adam” or “SGD” or “RMSprop”
loss str MSE currently only MSE is supported
obsprocessors list | None None default: []
device str cpu “cuda” or “cpu”
agent_name str | None SAC
network_actor_mu_params dict None
network_actor_sigma_params dict None
network_critic_params dict None

source

SACAgent

 SACAgent (environment_info:ddopai.utils.MDPInfo, hidden_layers:List=None,
           activation:str='relu', learning_rate_actor:float=0.0003,
           learning_rate_critic:float|None=None,
           initial_replay_size:int=64, max_replay_size:int=50000,
           batch_size:int=64, warmup_transitions:int=100,
           lr_alpha:float=0.0003, tau:float=0.005,
           log_std_min:float=-20.0, log_std_max:float=2.0,
           use_log_alpha_loss=False, target_entropy:float|None=None,
           drop_prob:float=0.0, batch_norm:bool=False,
           init_method:str='xavier_uniform', optimizer:str='Adam',
           loss:str='MSE', obsprocessors:list|None=None, device:str='cpu',
           agent_name:str|None='SAC', observation_space_shape=None,
           action_space_shape=None)

XXX

Type Default Details
environment_info MDPInfo
hidden_layers List None if None, then default is [64, 64]
activation str relu “relu”, “sigmoid”, “tanh”, “leakyrelu”, “elu”
learning_rate_actor float 0.0003
learning_rate_critic float | None None If none, then it is set to learning_rate_actor
initial_replay_size int 64
max_replay_size int 50000
batch_size int 64
warmup_transitions int 100
lr_alpha float 0.0003
tau float 0.005
log_std_min float -20.0
log_std_max float 2.0
use_log_alpha_loss bool False
target_entropy float | None None
drop_prob float 0.0
batch_norm bool False
init_method str xavier_uniform “xavier_uniform”, “xavier_normal”, “he_normal”, “he_uniform”, “normal”, “uniform”
optimizer str Adam “Adam” or “SGD” or “RMSprop”
loss str MSE currently only MSE is supported
obsprocessors list | None None default: []
device str cpu “cuda” or “cpu”
agent_name str | None SAC
observation_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
action_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
from ddopai.envs.inventory.single_period import NewsvendorEnv
from ddopai.dataloaders.tabular import XYDataLoader
from ddopai.experiments.experiment_functions import run_experiment, test_agent
INFO:numexpr.utils:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
val_index_start = 8000 #90_000
test_index_start = 9000 #100_000

X = np.random.standard_normal((10000, 2))
Y = np.random.standard_normal((10000, 1))
Y += 2*X[:,0].reshape(-1, 1) + 3*X[:,1].reshape(-1, 1)
Y = X[:,0].reshape(-1, 1)
# truncate Y at 0:
Y = np.maximum(Y, 0)
# normalize Y max to 1
Y = Y/np.max(Y)

# print(np.max(Y))
# print(X.shape, Y.shape)

clip_action = ClipAction(0., 1.)

dataloader = XYDataLoader(X, Y, val_index_start, test_index_start, lag_window_params =  {'lag_window': 0, 'include_y': False, 'pre_calc': True})

environment = NewsvendorEnv(
    dataloader = dataloader,
    underage_cost = 0.42857,
    overage_cost = 1.0,
    gamma = 0.999,
    horizon_train = 365,
    q_bound_high = 1.0,
    q_bound_low = -0.1,
    postprocessors = [clip_action],
)

agent = SACAgent(environment.mdp_info,
                obsprocessors = None,      # default: []
                device="cpu", # "cuda" or "cpu"
)

environment.test()
agent.eval()

R, J = test_agent(agent, environment)

print(R, J)

environment.train()
agent.train()
environment.print=False

# run_experiment(agent, environment, n_epochs=50, n_steps=1000, run_id = "test", save_best=True, print_freq=1) # fit agent via run_experiment function

environment.test()
agent.eval()

R, J = test_agent(agent, environment)

print(R, J)
/Users/magnus/miniforge3/envs/inventory_gym_2/lib/python3.11/site-packages/gymnasium/spaces/box.py:130: UserWarning: WARN: Box bound precision lowered by casting to float32
  gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
INFO:root:Actor network (mu network):
/Users/magnus/miniforge3/envs/inventory_gym_2/lib/python3.11/site-packages/torchinfo/torchinfo.py:462: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  action_fn=lambda data: sys.getsizeof(data.storage()),
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
MLPActor                                 [1, 1]                    --
├─Sequential: 1-1                        [1, 1]                    --
│    └─Linear: 2-1                       [1, 64]                   192
│    └─ReLU: 2-2                         [1, 64]                   --
│    └─Dropout: 2-3                      [1, 64]                   --
│    └─Linear: 2-4                       [1, 64]                   4,160
│    └─ReLU: 2-5                         [1, 64]                   --
│    └─Dropout: 2-6                      [1, 64]                   --
│    └─Linear: 2-7                       [1, 1]                    65
│    └─Identity: 2-8                     [1, 1]                    --
==========================================================================================
Total params: 4,417
Trainable params: 4,417
Non-trainable params: 0
Total mult-adds (M): 0.00
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.02
Estimated Total Size (MB): 0.02
==========================================================================================
INFO:root:################################################################################
INFO:root:Critic network:
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
MLPStateAction                           --                        --
├─Sequential: 1-1                        [1, 1]                    --
│    └─Linear: 2-1                       [1, 64]                   256
│    └─ReLU: 2-2                         [1, 64]                   --
│    └─Dropout: 2-3                      [1, 64]                   --
│    └─Linear: 2-4                       [1, 64]                   4,160
│    └─ReLU: 2-5                         [1, 64]                   --
│    └─Dropout: 2-6                      [1, 64]                   --
│    └─Linear: 2-7                       [1, 1]                    65
│    └─Identity: 2-8                     [1, 1]                    --
==========================================================================================
Total params: 4,481
Trainable params: 4,481
Non-trainable params: 0
Total mult-adds (M): 0.00
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.02
Estimated Total Size (MB): 0.02
==========================================================================================
-245.3059010258002 -154.16627214771364
-245.3059010258002 -154.16627214771364

source

SACRNNAgent

 SACRNNAgent (environment_info:ddopai.utils.MDPInfo,
              hidden_layers_RNN:int=1, num_hidden_units_RNN:int=64,
              RNN_cell:str='GRU', hidden_layers_MLP:List=None,
              hidden_layers_input_MLP:List=None, activation:str='relu',
              learning_rate_actor:float=0.0003,
              learning_rate_critic:float|None=None,
              initial_replay_size:int=64, max_replay_size:int=50000,
              batch_size:int=64, warmup_transitions:int=100,
              lr_alpha:float=0.0003, tau:float=0.005,
              log_std_min:float=-20.0, log_std_max:float=2.0,
              use_log_alpha_loss=False, target_entropy:float|None=None,
              drop_prob:float=0.0, batch_norm:bool=False,
              init_method:str='xavier_uniform', optimizer:str='Adam',
              loss:str='MSE', obsprocessors:list|None=None,
              device:str='cpu', agent_name:str|None='SAC',
              observation_space_shape=None, action_space_shape=None)

XXX

Type Default Details
environment_info MDPInfo
hidden_layers_RNN int 1 Initial RNN layers
num_hidden_units_RNN int 64 Initial number of hidden units in RNN layers
RNN_cell str GRU “LSTM”, “GRU”, “RNN”
hidden_layers_MLP List None MLP layers behind RNN: if None, then default is [64, 64]
hidden_layers_input_MLP List None MLP layers for non-time features. Default is None
activation str relu “relu”, “sigmoid”, “tanh”, “leakyrelu”, “elu”
learning_rate_actor float 0.0003
learning_rate_critic float | None None If none, then it is set to learning_rate_actor
initial_replay_size int 64
max_replay_size int 50000
batch_size int 64
warmup_transitions int 100
lr_alpha float 0.0003
tau float 0.005
log_std_min float -20.0
log_std_max float 2.0
use_log_alpha_loss bool False
target_entropy float | None None
drop_prob float 0.0
batch_norm bool False
init_method str xavier_uniform “xavier_uniform”, “xavier_normal”, “he_normal”, “he_uniform”, “normal”, “uniform”
optimizer str Adam “Adam” or “SGD” or “RMSprop”
loss str MSE currently only MSE is supported
obsprocessors list | None None default: []
device str cpu “cuda” or “cpu”
agent_name str | None SAC
observation_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
action_space_shape NoneType None optional when it cannot be inferred from environment_info (e.g. for dict spaces)
from ddopai.envs.inventory.single_period import NewsvendorEnv
from ddopai.dataloaders.tabular import XYDataLoader
from ddopai.experiments.experiment_functions import run_experiment, test_agent
val_index_start = 8000 #90_000
test_index_start = 9000 #100_000

X = np.random.standard_normal((10000, 2))
Y = np.random.standard_normal((10000, 1))
Y += 2*X[:,0].reshape(-1, 1) + 3*X[:,1].reshape(-1, 1)
Y = X[:,0].reshape(-1, 1)
# truncate Y at 0:
Y = np.maximum(Y, 0)
# normalize Y max to 1
Y = Y/np.max(Y)

clip_action = ClipAction(0., 1.)

dataloader = XYDataLoader(X, Y, val_index_start, test_index_start, lag_window_params =  {'lag_window': 5, 'include_y': True, 'pre_calc': True})

environment = NewsvendorEnv(
    dataloader = dataloader,
    underage_cost = 0.42857,
    overage_cost = 1.0,
    gamma = 0.999,
    horizon_train = 365,
    q_bound_high = 1.0,
    q_bound_low = -0.1,
    postprocessors = [clip_action],
)

agent = SACRNNAgent(environment.mdp_info,
                obsprocessors = None,      # default: []
                device="cpu", # "cuda" or "cpu"
)

environment.test()
agent.eval()

R, J = test_agent(agent, environment)

print(R, J)

environment.train()
agent.train()
environment.print=False

# run_experiment(agent, environment, n_epochs=50, n_steps=1000, run_id = "test", save_best=True, print_freq=1) # fit agent via run_experiment function

environment.test()
agent.eval()

R, J = test_agent(agent, environment)

print(R, J)
/Users/magnus/miniforge3/envs/inventory_gym_2/lib/python3.11/site-packages/gymnasium/spaces/box.py:130: UserWarning: WARN: Box bound precision lowered by casting to float32
  gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
INFO:root:Actor network (mu network):
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
RNNActor                                 [1, 1]                    --
├─RNNMLPHybrid: 1-1                      [1, 1]                    --
│    └─Sequential: 2-1                   [1, 6, 64]                --
│    │    └─SpecificRNNWrapper: 3-1      [1, 6, 64]                13,248
│    │    └─ReLU: 3-2                    [1, 6, 64]                --
│    └─Sequential: 2-2                   [1, 1]                    --
│    │    └─Linear: 3-3                  [1, 64]                   4,160
│    │    └─ReLU: 3-4                    [1, 64]                   --
│    │    └─Dropout: 3-5                 [1, 64]                   --
│    │    └─Linear: 3-6                  [1, 64]                   4,160
│    │    └─ReLU: 3-7                    [1, 64]                   --
│    │    └─Dropout: 3-8                 [1, 64]                   --
│    │    └─Linear: 3-9                  [1, 1]                    65
==========================================================================================
Total params: 21,633
Trainable params: 21,633
Non-trainable params: 0
Total mult-adds (M): 0.09
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.09
Estimated Total Size (MB): 0.09
==========================================================================================
INFO:root:################################################################################
INFO:root:Critic network:
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
RNNStateAction                           --                        --
├─RNNMLPHybrid: 1-1                      [1, 1]                    --
│    └─Sequential: 2-1                   [1, 6, 64]                --
│    │    └─SpecificRNNWrapper: 3-1      [1, 6, 64]                13,248
│    │    └─ReLU: 3-2                    [1, 6, 64]                --
│    └─Sequential: 2-2                   [1, 1]                    --
│    │    └─Linear: 3-3                  [1, 64]                   4,224
│    │    └─ReLU: 3-4                    [1, 64]                   --
│    │    └─Dropout: 3-5                 [1, 64]                   --
│    │    └─Linear: 3-6                  [1, 64]                   4,160
│    │    └─ReLU: 3-7                    [1, 64]                   --
│    │    └─Dropout: 3-8                 [1, 64]                   --
│    │    └─Linear: 3-9                  [1, 1]                    65
==========================================================================================
Total params: 21,697
Trainable params: 21,697
Non-trainable params: 0
Total mult-adds (M): 0.09
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.09
Estimated Total Size (MB): 0.09
==========================================================================================
-383.1306977574299 -243.60956423506602
-383.1306977574299 -243.60956423506602