ETL_nodes
General function for simple input/output
input_output_node
input_output_node (*inputs)
*This is a node for cases where the raw data can be directly passed through without processing steps.
Accepts multiple inputs and returns them unpacked. If there’s only one input, it returns the input itself.*
Helper nodes for data transformation
convert_hirarchic_to_dict
convert_hirarchic_to_dict (categories:pandas.core.frame.DataFrame, single_leaf_level=True)
*This function converts a strictly hirarchic dataframe into a dictioary. Strictly hirarchic means that each column represents a hirarchy level, and each subcategory belongs to exactly one higher level category. In the dataframe, each subcategory belongs to exactly one higher level category.
The dictionary is the general form that is used by the write_db_node as input.
Requirements: - IMPORTANT: This function is only for strictly hierarchical categories, i.e., each subcategory belongs to exactly one higher level category. - The categories must be in descending order (i.e., the first columns the highest level category, second column is the second highest level category, etc.) - The column names can carry a name, if required (e.g., “category”, “department”, etc.). - The categories itself will be saved under generic levles (“1”, “2”, etc.), but the specific names will be returned in separate list for saving
Inputs: - categories: A pandas dataframe with the categories. The columns must be in descending order (i.e., the first columns the highest level category, second column is the second highest level category, etc.) - single_leaf_level: A boolean that indicates if the categories dataframe has only one leaf level. If True, the function will return a dictionary with the leaf level as the last level. If False, leafs may be at different levels.
Outputs: - mappings: A dictionary with the levels as keys and a dictionary as values. The dictionary has the category names as keys and list of parents. This means that the dictionary is more general than the dataframe and is the required input for the write_db_node. - category_level_names: A list of the column names of the categories dataframe.*