from ddopai.envs.inventory.single_period import NewsvendorEnv
from ddopai.dataloaders.tabular import XYDataLoader
from ddopai.agents.basic import RandomAgent
Experiment functions
EarlyStoppingHandler
EarlyStoppingHandler (patience:int=50, warmup:int=100, criteria:str='J', direction:str='max')
Class to handle early stopping during experiments. The EarlyStoppingHandler handler calculates the average score over the last “patience” epochs and compares it to the average score over the previous “patience” epochs. Note that one epoch we define here as time in between evaluating on a validation set, for supervised learning typically one epoch is one pass through the training data. For reinforcement learning, in between each evaluation epoch there may be less than one, one, or many episodes played in the training environment.
Type | Default | Details | |
---|---|---|---|
patience | int | 50 | Number of epochs to evaluate for stopping |
warmup | int | 100 | How many initial epochs to wait before evaluating |
criteria | str | J | Whether to use discounted rewards J or total rewards R as criteria |
direction | str | max | Whether reward shall be maximized or minimized |
EarlyStoppingHandler.add_result
EarlyStoppingHandler.add_result (J:float, R:float)
Add the result of the last epoch to the history and check if the experiment should be stopped.
Type | Details | |
---|---|---|
J | float | Return (discounted rewards) of the last epoch |
R | float | Total rewards of the last epoch |
Returns | bool |
Helper functions
Some functions that are needed to run an experiment
save_agent
save_agent (agent:ddopai.agents.base.BaseAgent, experiment_dir:str, save_best:bool, R:float, J:float, best_R:float, best_J:float, criteria:str='J', force_save=False)
Save the agent if it has improved either R or J, depending on the criteria argument, vs. the previous epochs
Type | Default | Details | |
---|---|---|---|
agent | BaseAgent | Any agent inheriting from BaseAgent | |
experiment_dir | str | Directory to save the agent, | |
save_best | bool | ||
R | float | ||
J | float | ||
best_R | float | ||
best_J | float | ||
criteria | str | J | |
force_save | bool | False |
update_best
update_best (R:float, J:float, best_R:float, best_J:float)
Update the best total rewards R and the best discounted rewards J.
Type | Details | |
---|---|---|
R | float | |
J | float | |
best_R | float | |
best_J | float |
log_info
log_info (R:float, J:float, n_epochs:int, tracking:Literal['wandb'], mode:Literal['train','val','test'])
Logs the same R, J information repeatedly for n_epoochs. E.g., to draw a straight line in wandb for algorithmes such as XGB, RF, etc. that can be comparared to the learning curves of supervised or reinforcement learning algorithms.
Type | Details | |
---|---|---|
R | float | |
J | float | |
n_epochs | int | |
tracking | Literal | only wandb implemented so far |
mode | Literal |
calculate_score
calculate_score (dataset:List, env:ddopai.envs.base.BaseEnvironment)
Calculate the total rewards R and the discounted rewards J of a dataset.
Type | Details | |
---|---|---|
dataset | List | |
env | BaseEnvironment | Any environment inheriting from BaseEnvironment |
Returns | Tuple |
Experiment functions
Functions to run experiments
run_experiment
run_experiment (agent:ddopai.agents.base.BaseAgent, env:ddopai.envs.base.BaseEnvironment, n_epochs:int, n_steps:int=None, early_stopping_handler:Optional[__main_ _.EarlyStoppingHandler]=None, save_best:bool=True, performance_criterion:str='J', tracking:Optional[str]=None, results_dir:str='results', run_id:Optional[str]=None, print_freq:int=10, eval_step_info=False, return_score=False)
Run an experiment with the given agent and environment for n_epochs. It automaticall dedects if the train mode of the agent is direct, epochs_fit or env_interaction and runs the experiment accordingly.
Type | Default | Details | |
---|---|---|---|
agent | BaseAgent | ||
env | BaseEnvironment | ||
n_epochs | int | ||
n_steps | int | None | Number of steps to interact with the environment per epoch. Will be ignored for direct_fit and epchos_fit agents |
early_stopping_handler | Optional | None | |
save_best | bool | True | |
performance_criterion | str | J | other: “R” |
tracking | Optional | None | other: “wandb” |
results_dir | str | results | |
run_id | Optional | None | |
print_freq | int | 10 | |
eval_step_info | bool | False | |
return_score | bool | False |
Important notes on running experiments
Training mode:
- Agents have either a training mode
direct_fit
orepochs_fit
orenv_interaction
.direct_fit
means that agents are called with a single call to the fit method, providing the full X and Y dataset.epochs_fit
means that agents are training iteratively via epochs. It is assumed that they then have access to the dataloader.
Train, val, test mode:
- The function always sets the agent and environment to the approproate dataset mode (and thereofore indirectly the dataloader via then environment).
Early stopping:
- Can be optionally applied for
epochs_fit
andenv_interaction
agents.
Save best agent:
The
save_agent()
functions, given thesave_best
param isTrue
, will save the best agent based on the validation score.At test time at a later point, one can then load the best agent and evaluate it on the test set (not done automatically by this function).
Logging:
- By setting logging to
"wandb"
the function will log J and R to wandb.
test_agent
test_agent (agent:ddopai.agents.base.BaseAgent, env:ddopai.envs.base.BaseEnvironment, return_dataset=False, save_features=False, tracking=None, eval_step_info=False)
Tests the agent on the environment for a single episode
Type | Default | Details | |
---|---|---|---|
agent | BaseAgent | ||
env | BaseEnvironment | ||
return_dataset | bool | False | |
save_features | bool | False | |
tracking | NoneType | None | other: “wandb”, |
eval_step_info | bool | False |
run_test_episode
run_test_episode (env:ddopai.envs.base.BaseEnvironment, agent:ddopai.agents.base.BaseAgent, eval_step_info:bool=False, save_features:bool=False)
Runs an episode to test the agent’s performance. It assumes, that agent and environment are initialized, in test/val mode and have done reset.
Type | Default | Details | |
---|---|---|---|
env | BaseEnvironment | Any environment inheriting from BaseEnvironment | |
agent | BaseAgent | Any agent inheriting from BaseAgent | |
eval_step_info | bool | False | Print step info during evaluation |
save_features | bool | False | Save features (observation) of the dataset. Can be turned off since they sometimes become very large with many lag information |
Usage example for test_agent()
:
= 80 #90_000
val_index_start = 90 #100_000
test_index_start
= np.random.rand(100, 2)
X = np.random.rand(100, 1)
Y
= XYDataLoader(X, Y, val_index_start, test_index_start)
dataloader
= NewsvendorEnv(
environment = dataloader,
dataloader = 0.42857,
underage_cost = 1.0,
overage_cost = 0.999,
gamma = 365,
horizon_train
)
= RandomAgent(environment.mdp_info)
agent
environment.test()
= test_agent(agent, environment)
R, J
print(f"R: {R}, J: {J}")
R: -7.269816766556392, J: -7.236762453375597