Dataset loader

Class to load datasets available in GitHub releases of this repository.

Info

We provide a range of synthetic and real-world datasets to enable reproducible research. Typically we have multiple datasets of the same dataset type (e.g., 16 multivariate datasets following an arma(10,10) process). The datasets are available in the releases of this repository. Below are automated functions that help to easily download those datasets. Three steps to load datasets:

  • Step 1: Create a DatasetLoader object: datasetloader = DatasetLoader()

  • Step 2: Check available dataset types: datasetloader.show_dataset_types(show_num_datasets_per_type=True)

  • Step 3: Load a dataset: data = datasetloader.load_dataset("arma_10_10", 1)) where the first string argument is the name of the dataset type and the second integer argument is the dataset number.

Helper functions to load datasets


source

load_data_from_directory

 load_data_from_directory (dir)

source

unzip_file

 unzip_file (zip_file_path, output_dir, delete_zip_file=True)

source

download_file_from_github

 download_file_from_github (url, output_path, token=None)

source

get_asset_url

 get_asset_url (dataset_type, dataset_number, version='latest',
                token=None)

source

get_dataset_url

 get_dataset_url (dataset_type, dataset_number, release_tag, token=None)

source

get_release_tag

 get_release_tag (dataset_type, version, token=None)

source

get_all_release_tags

 get_all_release_tags (token=None)

Dataset Loader class


source

DatasetLoader

 DatasetLoader ()

Class to load datasets from the GitHub repository.


source

DatasetLoader.show_dataset_types

 DatasetLoader.show_dataset_types (show_num_datasets_per_type=False)

Show an overview of all dataset types available in the repository.

Type Default Details
show_num_datasets_per_type bool False Whether to show the number of datasets per type

source

DatasetLoader.load_dataset

 DatasetLoader.load_dataset (dataset_type:str, dataset_number:int,
                             overwrite:bool=False, version:str='latest',
                             token:str=None)

Load a dataset from the GitHub repository.

Type Default Details
dataset_type str
dataset_number int
overwrite bool False Whether to overwrite the dataset if it already exists
version str latest Which version of the dataset to load, “latest” or a specific version,
token str None GitHub token to enable more requests (otherwise limited to 60 requests per hour)

Example usage:

datasetloader = DatasetLoader()
datasetloader.show_dataset_types()
Univariate datasets:
bakery

Multivariate datasets:
arma_10_10
arma_2_2
ar_1
download_test = True

if download_test:
    data = datasetloader.load_dataset("bakery", 1) #arma_10_10 bakery
    X = data["data_raw_features"]
    y = data["data_raw_target"]
    print(X.shape, y.shape)
(127575, 13) (127575, 1)
date weekday month year is_schoolholiday is_holiday is_holiday_next2days store product rain temperature promotion_currentweek promotion_lastweek
0 2016-01-02 FRI JAN 2016 1 0 0 2 101 11.9 2.1 0 0
1 2016-01-03 SAT JAN 2016 1 0 0 2 101 4.1 2.6 0 0
2 2016-01-04 SUN JAN 2016 1 0 1 2 101 7.9 3.2 0 0
3 2016-01-05 MON JAN 2016 1 0 1 2 101 3.5 3.1 0 0
4 2016-01-06 TUE JAN 2016 1 1 0 2 101 0.1 4.1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
127570 2019-04-26 THU APR 2019 1 0 0 71 110 4.9 8.0 0 0
127571 2019-04-27 FRI APR 2019 1 0 0 71 110 6.1 7.8 0 0
127572 2019-04-28 SAT APR 2019 0 0 0 71 110 1.0 6.5 0 0
127573 2019-04-29 SUN APR 2019 0 0 1 71 110 9.1 6.5 0 0
127574 2019-04-30 MON APR 2019 0 0 1 71 110 0.0 10.3 0 0

127575 rows × 13 columns