Tabular dataloaders

Dataloaders for tabular data

source

XYDataLoader

 XYDataLoader (X:numpy.ndarray, Y:numpy.ndarray,
               val_index_start:Optional[int]=None,
               test_index_start:Optional[int]=None,
               lag_window_params:dict=None, normalize_features:dict=None)

A class for datasets with the typicall X, Y structure. Both X and Y are numpy arrays. X may be of shape (datapoints, features) or (datapoints, sequence_length, features) if lag features are used. The prep_lag_features can be used to create those lag features. Y is of shape (datapoints, units).

Type Default Details
X ndarray
Y ndarray
val_index_start Optional None
test_index_start Optional None
lag_window_params dict None default: {‘lag_window’: 0, ‘include_y’: False, ‘pre_calc’: False}
normalize_features dict None default: {‘normalize’: True, ‘ignore_one_hot’: True}

source

XYDataLoader.prep_lag_features

 XYDataLoader.prep_lag_features (lag_window:int=0, include_y:bool=False,
                                 pre_calc:bool=False)

Create lag feature for the dataset. If “inlcude_y” is true, then a lag-1 of of the target variable is added as a feature. If lag-window is > 0, the lag features are added as middle dimension to X. Note that this, e.g., means that with a lag window of 1, the data will include 2 time steps, the current features including lag-1 demand and the lag-1 features including lag-2 demand. If pre-calc is true, all these calculations are performed on the entire dataset reduce computation time later on at the expense of increases memory usage.

Type Default Details
lag_window int 0 length of the lage window
include_y bool False if lag demand shall be included as feature
pre_calc bool False if all lags are pre-calculated for the entire dataset

source

XYDataLoader.__getitem__

 XYDataLoader.__getitem__ (idx)

get item by index, depending on the dataset type (train, val, test)


source

XYDataLoader.get_all_X

 XYDataLoader.get_all_X (dataset_type:str='train')

Returns the entire features dataset. Return either the train, val, test, or all data.

Type Default Details
dataset_type str train can be ‘train’, ‘val’, ‘test’, ‘all’

source

XYDataLoader.get_all_Y

 XYDataLoader.get_all_Y (dataset_type:str='train')

Returns the entire target dataset. Return either the train, val, test, or all data.

Type Default Details
dataset_type str train can be ‘train’, ‘val’, ‘test’, ‘all’

Example usage of [`XYDataLoader`](https://opimwue.github.io/ddopai/10_dataloaders/tabular_dataloaders.html#xydataloader) for simple dataset:

X = np.random.standard_normal((100, 2))
Y = np.random.standard_normal((100, 1))
Y += 2*X[:,0].reshape(-1, 1) + 3*X[:,1].reshape(-1, 1)

dataloader = XYDataLoader(X = X, Y = Y)

sample_X, sample_Y = dataloader[0]
print("sample:", sample_X, sample_Y)
print("sample shape Y:", sample_Y.shape)

print("length:", len(dataloader))
sample: [0.19586287 1.09162108] [1.040336]
sample shape Y: (1,)
length: 100

Example usage of [`XYDataLoader`](https://opimwue.github.io/ddopai/10_dataloaders/tabular_dataloaders.html#xydataloader) on how to handle train, val, and test set:

X = np.random.standard_normal((10, 2))
Y = np.random.standard_normal((10, 1))
Y += 2*X[:,0].reshape(-1, 1) + 3*X[:,1].reshape(-1, 1)

dataloader = XYDataLoader(X = X, Y = Y, val_index_start=6, test_index_start=8)

sample_X, sample_Y = dataloader[0]

print("length train:", dataloader.len_train, "length val:", dataloader.len_val, "length test:", dataloader.len_test)

print("")
print("### Data from train set ###")
for i in range(dataloader.len_train):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)

dataloader.val()

print("")
print("### Data from val set ###")
for i in range(dataloader.len_val):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)

dataloader.test()

print("")
print("### Data from test set ###")
for i in range(dataloader.len_test):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)

dataloader.train()

print("")
print("### Data from train set again ###")
for i in range(dataloader.len_train):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)
length train: 6 length val: 2 length test: 2

### Data from train set ###
idx: 0 data: [ 0.08854902 -1.7602724 ] [-5.34363735]
idx: 1 data: [ 0.99129486 -1.78646157] [-0.9519102]
idx: 2 data: [0.66334628 0.01231061] [0.95274982]
idx: 3 data: [ 0.61796118 -0.54523986] [0.35028762]
idx: 4 data: [1.04676734 1.75569924] [5.92952598]
idx: 5 data: [ 0.21987025 -0.53602459] [0.66207364]

### Data from val set ###
idx: 0 data: [-1.54514703 -0.67784998] [-5.27601525]
idx: 1 data: [ 0.935785   -1.30048604] [-2.66055254]

### Data from test set ###
idx: 0 data: [1.86740017 0.79714291] [4.61669816]
idx: 1 data: [ 0.30325407 -0.62230244] [-2.03026803]

### Data from train set again ###
idx: 0 data: [ 0.08854902 -1.7602724 ] [-5.34363735]
idx: 1 data: [ 0.99129486 -1.78646157] [-0.9519102]
idx: 2 data: [0.66334628 0.01231061] [0.95274982]
idx: 3 data: [ 0.61796118 -0.54523986] [0.35028762]
idx: 4 data: [1.04676734 1.75569924] [5.92952598]
idx: 5 data: [ 0.21987025 -0.53602459] [0.66207364]

Example usage of [`XYDataLoader`](https://opimwue.github.io/ddopai/10_dataloaders/tabular_dataloaders.html#xydataloader) on how to include lag features:

X = np.random.standard_normal((10, 2))
Y = np.random.standard_normal((10, 1))
Y += 2*X[:,0].reshape(-1, 1) + 3*X[:,1].reshape(-1, 1)

lag_window_params = {'lag_window': 1, 'include_y': True, 'pre_calc': True}

dataloader = XYDataLoader(X = X, Y = Y, val_index_start=6, test_index_start=8, lag_window_params=lag_window_params)

sample_X, sample_Y = dataloader[0]

print("length train:", dataloader.len_train, "length val:", dataloader.len_val, "length test:", dataloader.len_test)

print("")
print("### Data from train set ###")
for i in range(dataloader.len_train):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)

dataloader.val()

print("")
print("### Data from val set ###")
for i in range(dataloader.len_val):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)

dataloader.test()

print("")
print("### Data from test set ###")
for i in range(dataloader.len_test):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)

dataloader.train()

print("")
print("### Data from train set again ###")
for i in range(dataloader.len_train):
    sample_X, sample_Y = dataloader[i]
    print("idx:", i, "data:", sample_X, sample_Y)
length train: 4 length val: 2 length test: 2

### Data from train set ###
idx: 0 data: [[ 0.73863651  0.6084497  -0.1193545 ]
 [ 0.35830697 -1.87500947  2.48387723]] [-4.9460667]
idx: 1 data: [[ 0.35830697 -1.87500947  2.48387723]
 [-1.11068046 -0.5626968  -4.9460667 ]] [-1.24390416]
idx: 2 data: [[-1.11068046 -0.5626968  -4.9460667 ]
 [ 0.89828028 -2.19265635 -1.24390416]] [-5.78471176]
idx: 3 data: [[ 0.89828028 -2.19265635 -1.24390416]
 [-0.09191616  0.32758207 -5.78471176]] [0.35156491]

### Data from val set ###
idx: 0 data: [[-0.09191616  0.32758207 -5.78471176]
 [ 1.51172992 -0.25329154  0.35156491]] [2.47560231]
idx: 1 data: [[ 1.51172992 -0.25329154  0.35156491]
 [ 0.17512356  0.93368771  2.47560231]] [1.80751149]

### Data from test set ###
idx: 0 data: [[ 0.17512356  0.93368771  2.47560231]
 [-0.65111828 -0.13138032  1.80751149]] [-1.55867887]
idx: 1 data: [[-0.65111828 -0.13138032  1.80751149]
 [ 0.41587237 -1.40709561 -1.55867887]] [-3.46579185]

### Data from train set again ###
idx: 0 data: [[ 0.73863651  0.6084497  -0.1193545 ]
 [ 0.35830697 -1.87500947  2.48387723]] [-4.9460667]
idx: 1 data: [[ 0.35830697 -1.87500947  2.48387723]
 [-1.11068046 -0.5626968  -4.9460667 ]] [-1.24390416]
idx: 2 data: [[-1.11068046 -0.5626968  -4.9460667 ]
 [ 0.89828028 -2.19265635 -1.24390416]] [-5.78471176]
idx: 3 data: [[ 0.89828028 -2.19265635 -1.24390416]
 [-0.09191616  0.32758207 -5.78471176]] [0.35156491]

source

MultiShapeLoader

 MultiShapeLoader (demand:pandas.core.frame.DataFrame,
                   time_features:pandas.core.frame.DataFrame,
                   time_SKU_features:pandas.core.frame.DataFrame,
                   mask:pandas.core.frame.DataFrame=None,
                   SKU_features:pandas.core.frame.DataFrame=None,
                   val_index_start:Optional[int]=None,
                   test_index_start:Optional[int]=None,
                   in_sample_val_test_SKUs:List=None,
                   out_of_sample_val_SKUs:List=None,
                   out_of_sample_test_SKUs:List=None,
                   lag_window_params:dict|None=None,
                   normalize_features:dict|None=None,
                   engineered_SKU_features:dict=None,
                   use_engineered_SKU_features:bool=False,
                   include_non_available:bool=False,
                   train_subset:int=None, train_subset_SKUs:List=None,
                   meta_learn_units:bool=False, lag_demand_normalization:O
                   ptional[Literal['minmax','standard','no_normalization']
                   ]='standard', demand_normalization:Literal['minmax','st
                   andard','no_normalization']='no_normalization',
                   demand_unit_size:float|None=None,
                   provide_additional_target:bool=False,
                   permutate_inputs:bool=False,
                   max_feature_dim:int|None=None)

A class designed for comlex datasets with mutlipe feature types. The class is more memory-efficient than the XYDataLoader, as it separate the storeage of SKU-specific feature, time-specific features, and time-SKU-specific features. The class works generically as long as those feature classes are provided during pre-processing. The class is designed to handle classic learning, but able to work in a meta-learning pipeline where no SKU-dimension is present and the model needs to make prediction on SKU-time level without knowhing the specific SKU.

Type Default Details
demand DataFrame Demand data of shape time x SKU
time_features DataFrame Features constant over SKU of shape time x time_features
time_SKU_features DataFrame Features varying over time and SKU of shape time x (time_SKU_features*SKU) with double index
mask DataFrame None Mask of shape time x SKU telling which SKUs are available at which time (can be used as mask during trainig or added to features)
SKU_features DataFrame None Features constant over time of shape SKU x SKU_features - only for algorithms learning across SKUs
val_index_start Optional None Validation index start on the time dimension
test_index_start Optional None Test index start on the time dimension
in_sample_val_test_SKUs List None SKUs in the training set to be used for validation and testing, out-of-sample w.r.t. time dimension
out_of_sample_val_SKUs List None SKUs to be hold-out for validation (can be same as test if no validation on out-of-sample SKUs required)
out_of_sample_test_SKUs List None SKUs to be hold-out for testing
lag_window_params dict | None None default: {‘lag_window’: 0, ‘include_y’: False, ‘pre_calc’: True}
normalize_features dict | None None default: {‘normalize’: True, ‘ignore_one_hot’: True}
engineered_SKU_features dict None default: [“mean_demand”, “std_demand”, “kurtosis_demand”, “skewness_demand”, “percentile_10_demand”, “percentile_30_demand”, “median_demand”, “percentile_70_demand”, “percentile_90_demand”, “inter_quartile_range”]
use_engineered_SKU_features bool False if engineered features shall be used
include_non_available bool False if timestep/SKU combination where the SKU was not available for sale shall be included. If included, it will be used as feature, otherwise as mask.
train_subset int None if only a subset of SKUs is used for training. Will always contain in_sample_val_test_SKUs and then fills the rest with random SKUs
train_subset_SKUs List None if train_subset is set, specific SKUs can be provided
meta_learn_units bool False if units (SKUs) are trained in the batch dimension to meta-learn across SKUs
lag_demand_normalization Optional standard minmax, standard, no_normalization or None. If None, same demand_normalization
demand_normalization Literal no_normalization ‘standard’ or ‘minmax’
demand_unit_size float | None None use same convention as for other dataloaders and enviornments, but here only full decimal values are allowed
provide_additional_target bool False follows ICL convention by providing actual demand to token, with the last token receiving 0
permutate_inputs bool False if the inputs shall be permutated during training for meta-learning
max_feature_dim int | None None
run_example = False

if run_example:
    from ddopai.datasets.kaggle_m5 import KaggleM5DatasetLoader

    data_path = "/Users/magnus/Documents/02_PhD/Reinforcement_Learning/general_purpose_drl/Newsvendor/kaggle_data" # For testing purposes, please specify the path to the data on your machine
    if data_path is not None:
        loader = KaggleM5DatasetLoader(data_path, overwrite=False, product_as_feature=False)
        demand, SKU_features, time_features, time_SKU_features, mask = loader.load_dataset()
    
    val_index_start = len(demand)-300
    test_index_start = len(demand)-100

    out_of_sample_val_SKUs = ["HOBBIESit_1_002_CA_1", "HOBBIES_1_003_CA_1"]
    out_of_sample_test_SKUs = ["HOBBIES_1_005_CA_1", "FOODS_3_819_WI_3"]

    dataloader = MultiShapeLoader(
        demand.copy(),
        SKU_features.copy(),
        time_features.copy(),
        time_SKU_features.copy(),
        mask.copy(),
        val_index_start=val_index_start,
        test_index_start=test_index_start,
        # in_sample_val_test_SKUs=["FOODS_3_825_WI_3"],
        out_of_sample_val_SKUs=out_of_sample_val_SKUs,
        out_of_sample_test_SKUs=out_of_sample_test_SKUs,
        lag_window_params = {'lag_window': 5, 'include_y': True, 'pre_calc': False},
        # train_subset=300,
        # train_subset_SKUs=["HOBBIES_1_001_CA_1", "HOBBIES_1_012_CA_1"],
        SKU_as_batch = True
        )
# dataloader.__getitem__(49844609) #986 with non-zero lag demand