Author(s): Anh Khoa NGO HO Originally published on Towards AI. Photo by AltumCode on Unsplash As a data scientist, I used to struggle with experiments involving the training and fine-tuning of large deep-learning models. For each experiment, I had to carefully document its hyperparameters (usually embedded in the model’s name, e.g., model_epochs_50_batch_100), the training dataset used, and the model's performance. Managing and reproducing experiments became highly complex during this time. Then, I discovered Kedro, which resolved all my issues. If you are conducting experiments in machine learning, I believe this article will prove immensely beneficial. What do we need to know about Kedro? Kedro, an open-source toolbox, provides an efficient template for conducting experiments in machine learning. It facilitates the creation of various data pipelines, including tasks such as data transformation, model training, and the storage of all pipeline outputs. Kedro revolves around three crucial concepts: Data catalog: the data catalog acts as a registry for all sources that the project can use to manage loading and saving data. It includes input datasets, processed datasets, and models. These sources are stored as dictionaries, referred to names (strings), in Yaml files. E.g., in the code below, when we call input_dataset, Kedro returns the csv file in filepath. input_dataset: type: pandas.CSVDataSet filepath: data/01_raw/dataset.csv processed_dataset: type: pandas.CSVDataSet filepath: data/02_intermediate/dataset.csv Node: a node serves as a wrapper for a pure Python function that defines the inputs and outputs of that function. It represents a small step in the pipeline. For example, in data processing, nodes can be defined for tasks like concatenating dataframes or creating new features. Inputs and outputs are sourced from the data catalog. from kedro.pipeline import node def process_data(df1, df2): return df1 + df2 node_process_data = node(func=process_data, inputs=[”input_data”, ”input_data”], outputs=“processed_data” Pipeline: a pipeline consists of a list of nodes where inputs and outputs are interconnected. from kedro.pipeline import pipeline pipeline_process_data = pipeline([node_process_data]) Kedro also boasts several valuable plugins: kedro-viz for visualizing data pipelines, kedro-mlflow for facilitating MLflow integration, and kedro-sagemaker for running a Kedro pipeline with Amazon SageMaker, etc., (read more). Kedro-viz This plugin serves as an interactive development tool for visualizing pipelines built with Kedro. It allows the display of dataframes and charts generated through the pipeline, facilitating the comparison of different experiment runs with a single click. In this article, we’ll explore how to track hyperparameters and performance scores of deep learning models using kedro-viz. Now, let’s delve into creating a benchmark in three steps. Step 1: set up our experiment The initial step involves installing the required library, Kedro, along with additional dependencies like PyTorch (build a deep learning model) and kedro-viz, using pip install. For detailed instructions, refer to installing kedro, installing kedro-viz. It's important to note that there are variations between Kedro versions, so we specifically set the version to 0.19.1. Additionally, it is advisable to create a Python environment (consider using conda with the command conda create --name environment_name). pip install kedro==0.19.1pip install kedro-vizpip3 install torch Creating a project template with necessary files and folders is achieved with the following command (read more): kedro new After providing the project name and configuring options, the resulting folder structure includes: conf: This folder keeps our configuration settings for different stages of our data pipeline (e.g., development, testing, and production stage). By default, Kedro has a base and a local environment. Read more.– base contains the default settings that are used across our pipelines:(a)catalog.yml: this is a default yaml file to describe our data catalog. If we want to create more catalog files, their names must have the format catalog_*.yml. (b)parameters.yml: this is a default yaml file to describe our default parameters. A general format for naming these files is parameters_*.yml.– local is used for configuration that is either user-specific (e.g. IDE configuration) or protected (e.g. security keys). These local settings can override the settings in the folder base. data: it contains all inputs/outputs of our pipeline (see Data Catalog). Kedro proposes a list of sub-folders to store these inputs/outputs e.g., raw data in 01_raw, pre-processed data in 02_intermediate, trained models in 06_models, predicted data in 07_model_output, performance scores and figures in 08_reporting.– 01_raw – 02_intermediate – 03_primary – 04_feature – 05_model_input – 06_models – 07_model_output – 08_reporting docs: this is for documentation. You can use Sphinx to quickly build documentation. logs: we use this folder to keep all logs. notebooks: we put our jupyter notebooks here. src: houses code, including functions, nodes, and pipelines. To initiate an empty pipeline, utilize the command kedro pipeline create pipeline_name. Subsequently, a folder named pipeline_name is generated, and all created pipeline folders are stored in the directory src/project_name/pipelines. Within a pipeline folder pipeline_name, two essential files are present: nodes.py: contains node functions contributing to data processing. pipeline.py: used for constructing the pipeline. Furthermore, a configuration settings file, parameters_<pipeline_name>.yml, is established in conf/base/ to store parameters specific to the created pipeline. In the subsequent step, we will customize these files to define our pipeline. Step 2: define our data pipeline Utilizing a small dataset (sklearn.datasets.load_digits), we define a simple pipeline that involves getting data and storing it locally. In nodes.py, we define a function to get the dataset. import pandas as pdfrom sklearn.datasets import load_digitsdef get_dataset() -> pd.DataFrame: data = load_digits() df_data = pd.DataFrame(data['data'], columns=data['feature_names']) df_data['target'] = data['target'] return df_data In the file src/project_name/pipelines/pipeline_name/pipeline.py, we generate a list of nodes named pipeline_example. Each node comprises func (the function's name, e.g., get_dataset), inputs, and outputs. The nature of inputs and outputs can be a string, a list of strings, or None, depending on the inputs/outputs of the corresponding function. It's important to note that a string denotes a data name that we can later utilize in the subsequent node of our pipeline. from kedro.pipeline import Pipeline, pipeline, nodefrom .nodes import get_datasetdef create_pipeline(**kwargs) -> Pipeline: pipeline_example = [node( func=get_dataset, inputs=None, outputs="raw_data" ) ] return pipeline(pipeline_example) We define data names in conf/base/catalog.yml and specify their storage details (filepath). There are many options for storing data: local file systems, network file systems, cloud object stores, […]
↧