A class for datasets with the typicall X, Y structure. Both X and Y are numpy arrays. X may be of shape (datapoints, features) or (datapoints, sequence_length, features) if lag features are used. The prep_lag_features can be used to create those lag features. Y is of shape (datapoints, units).
Create lag feature for the dataset. If “inlcude_y” is true, then a lag-1 of of the target variable is added as a feature. If lag-window is > 0, the lag features are added as middle dimension to X. Note that this, e.g., means that with a lag window of 1, the data will include 2 time steps, the current features including lag-1 demand and the lag-1 features including lag-2 demand. If pre-calc is true, all these calculations are performed on the entire dataset reduce computation time later on at the expense of increases memory usage.
Type
Default
Details
lag_window
int
0
length of the lage window
include_y
bool
False
if lag demand shall be included as feature
pre_calc
bool
False
if all lags are pre-calculated for the entire dataset
Example usage of [`XYDataLoader`](https://opimwue.github.io/ddopai/10_dataloaders/tabular_dataloaders.html#xydataloader) on how to handle train, val, and test set:
X = np.random.standard_normal((10, 2))Y = np.random.standard_normal((10, 1))Y +=2*X[:,0].reshape(-1, 1) +3*X[:,1].reshape(-1, 1)dataloader = XYDataLoader(X = X, Y = Y, val_index_start=6, test_index_start=8)sample_X, sample_Y = dataloader[0]print("length train:", dataloader.len_train, "length val:", dataloader.len_val, "length test:", dataloader.len_test)print("")print("### Data from train set ###")for i inrange(dataloader.len_train): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)dataloader.val()print("")print("### Data from val set ###")for i inrange(dataloader.len_val): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)dataloader.test()print("")print("### Data from test set ###")for i inrange(dataloader.len_test): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)dataloader.train()print("")print("### Data from train set again ###")for i inrange(dataloader.len_train): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)
length train: 6 length val: 2 length test: 2
### Data from train set ###
idx: 0 data: [ 0.08854902 -1.7602724 ] [-5.34363735]
idx: 1 data: [ 0.99129486 -1.78646157] [-0.9519102]
idx: 2 data: [0.66334628 0.01231061] [0.95274982]
idx: 3 data: [ 0.61796118 -0.54523986] [0.35028762]
idx: 4 data: [1.04676734 1.75569924] [5.92952598]
idx: 5 data: [ 0.21987025 -0.53602459] [0.66207364]
### Data from val set ###
idx: 0 data: [-1.54514703 -0.67784998] [-5.27601525]
idx: 1 data: [ 0.935785 -1.30048604] [-2.66055254]
### Data from test set ###
idx: 0 data: [1.86740017 0.79714291] [4.61669816]
idx: 1 data: [ 0.30325407 -0.62230244] [-2.03026803]
### Data from train set again ###
idx: 0 data: [ 0.08854902 -1.7602724 ] [-5.34363735]
idx: 1 data: [ 0.99129486 -1.78646157] [-0.9519102]
idx: 2 data: [0.66334628 0.01231061] [0.95274982]
idx: 3 data: [ 0.61796118 -0.54523986] [0.35028762]
idx: 4 data: [1.04676734 1.75569924] [5.92952598]
idx: 5 data: [ 0.21987025 -0.53602459] [0.66207364]
Example usage of [`XYDataLoader`](https://opimwue.github.io/ddopai/10_dataloaders/tabular_dataloaders.html#xydataloader) on how to include lag features:
X = np.random.standard_normal((10, 2))Y = np.random.standard_normal((10, 1))Y +=2*X[:,0].reshape(-1, 1) +3*X[:,1].reshape(-1, 1)lag_window_params = {'lag_window': 1, 'include_y': True, 'pre_calc': True}dataloader = XYDataLoader(X = X, Y = Y, val_index_start=6, test_index_start=8, lag_window_params=lag_window_params)sample_X, sample_Y = dataloader[0]print("length train:", dataloader.len_train, "length val:", dataloader.len_val, "length test:", dataloader.len_test)print("")print("### Data from train set ###")for i inrange(dataloader.len_train): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)dataloader.val()print("")print("### Data from val set ###")for i inrange(dataloader.len_val): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)dataloader.test()print("")print("### Data from test set ###")for i inrange(dataloader.len_test): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)dataloader.train()print("")print("### Data from train set again ###")for i inrange(dataloader.len_train): sample_X, sample_Y = dataloader[i]print("idx:", i, "data:", sample_X, sample_Y)
A class designed for comlex datasets with mutlipe feature types. The class is more memory-efficient than the XYDataLoader, as it separate the storeage of SKU-specific feature, time-specific features, and time-SKU-specific features. The class works generically as long as those feature classes are provided during pre-processing. The class is designed to handle classic learning, but able to work in a meta-learning pipeline where no SKU-dimension is present and the model needs to make prediction on SKU-time level without knowhing the specific SKU.
Type
Default
Details
demand
DataFrame
Demand data of shape time x SKU
time_features
DataFrame
Features constant over SKU of shape time x time_features
time_SKU_features
DataFrame
Features varying over time and SKU of shape time x (time_SKU_features*SKU) with double index
mask
DataFrame
None
Mask of shape time x SKU telling which SKUs are available at which time (can be used as mask during trainig or added to features)
SKU_features
DataFrame
None
Features constant over time of shape SKU x SKU_features - only for algorithms learning across SKUs
val_index_start
Optional
None
Validation index start on the time dimension
test_index_start
Optional
None
Test index start on the time dimension
in_sample_val_test_SKUs
List
None
SKUs in the training set to be used for validation and testing, out-of-sample w.r.t. time dimension
out_of_sample_val_SKUs
List
None
SKUs to be hold-out for validation (can be same as test if no validation on out-of-sample SKUs required)