Quickstart Tutorial#

In this tutorial, you’ll learn how to build a basic Machine Learning project in Dataiku, from data exploration to model development, using mainly Jupyter Notebooks.

Prerequisites#

  • Have access to a Dataiku 12+ instance.

  • Create a Python>=3.8 code environment named py_quickstart with the following required packages:

mlflow
scikit-learn>=1.0,<1.4
scipy<1.12.0
statsmodels
seaborn

Note

In Dataiku, the equivalent of virtual environments is called a “code environment.” The code environment documentation provides more information and instructions for creating a new Python code environment.

Installation#

Import the project#

On the Dataiku homepage, select + NEW PROJECT > DSS Tutorials. In the Quick Start section, select Developers Quick Start.

Alternatively, you can download the project from this page and then upload it to your Dataiku instance: + NEW PROJECT > Import project.

Set the code environment#

To ensure the code environment is automatically selected for running all the Python scripts in your project, we will change the project settings to use it by default.

  • On the top bar, select … > Settings > Code env selection.

  • In the Default Python code env:

    • Change Mode to Select an environment.

    • In the Environment parameter, select the code environment you’ve just created.

    • Click the Save button or do a Ctrl+S

screenshot-code-env-settings

Set up the project#

This tutorial comes with the following:

  • a README.md file (stored in the project Wiki)

  • an input dataset: the Heart Failure Prediction Dataset

  • three Jupyter Notebooks that you will leverage to build the project

  • a Python repository stored in the project library, with some Python functions that will be used in the different notebooks:

utils/data_processing.py
"""data_processing.py

This file contains data preparation functions to process the heart measures dataset.
"""

import pandas as pd

def transform_heart_categorical_measures(df, chest_pain_colname, resting_ecg_colname, 
                                         exercise_induced_angina_colname, st_slope_colname, sex_colname):
    """
    Transforms each category from the given categorical columns into int value, using specific replacement rules for each column.
    
    :param pd.DataFrame df: the input dataset
    :param str chest_pain_colname: the name of the column containing information relative to chest pain type
    :param str resting_ecg_colname: the name of the column containing information relative to the resting electrocardiogram results
    :param str exercise_induced_angina_colname: the name of the column containing information relative to exercise-induced angina
    :param str st_slope_colname: the name of the column containing information relative to the slope of the peak exercise ST segment
    :param str sex_colname: the name of the column containing information relative to the patient gender
    
    :returns: the dataset with transform categorical columns
    :rtype: pd.DataFrame
    """
    df[chest_pain_colname].replace({'TA':1, 'ATA':2, 'NAP': 3, 'ASY': 4}, inplace=True)
    df[resting_ecg_colname].replace({'Normal':0, 'ST':1, 'LVH':2}, inplace=True)
    df[exercise_induced_angina_colname].replace({'N':0, 'Y':1}, inplace=True)
    df[st_slope_colname].replace({'Down':0, 'Flat':1, 'Up':2}, inplace=True)
    df[sex_colname].replace({'M': 0, 'F': 1}, inplace=True)
    return df
utils/model_training.py
"""model_training.py

This file contains ML modeling functions to grid search best hyper parameters of a Scikit-Learn model and cross evaluate a model.
"""

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate

def find_best_parameters(X, y, estimator, params, cv=5):
    """
    Performs a grid search on the sklearn estimator over a set of hyper parameters and return the best hyper parameters.
    
    :param pd.DataFrame X: The data to fit
    :param pd.Series y: the target variable to predict
    :param sklearn-estimator estimator: The scikit-learn model used to fit the data
    :param dict params: the set of hyper parameters to search on
    :param int cv: the number of folds to use for cross validation, default is 5
    
    :returns: the best hyper parameters
    :rtype: dict
    """
    grid = GridSearchCV(estimator, params, cv=cv)
    grid.fit(X, y)
    return grid.best_params_

def cross_validate_scores(X, y, estimator, cv=5, scoring=['accuracy']):
    """
    Performs a cross evaluation of the scikit learn model over n folds.
    
    :param pd.DataFrame X: The data to fit
    :param pd.Series y: the target variable to predict
    :param sklearn-estimator estimator: The scikit-learn model used to fit the data
    :param int cv: the number of folds to use for cross validation
    :param list scoring: the list of performance metrics to use to evaluate the model
    
    :returns: the average result for each performance metrics over the n folds
    :rtype: dict
    """
    cross_val = cross_validate(estimator, X, y, cv=cv, scoring=scoring)
    metrics_result = {}
    for metric in scoring:
        metrics_result[metric] = np.mean(cross_val['test_'+metric])
    return metrics_result

The project aims to build a binary predictive Machine Learning model to predict the risk of heart failure based on health information. For that, you’ll go through the standard steps of a Machine Learning project: data exploration, data preparation, machine learning modeling using different ML models, and model evaluation.

Instructions#

The project is composed of three notebooks (they can be found in the Notebooks section: </> > Notebooks) that you will run one by one. For each notebook:

  1. Ensure you use the code environment specified in the Prerequisites (py_quickstart). You can change the Python kernel in the Notebook menu under Kernel > Change Kernel.

  2. Run the notebook cell by cell.

  3. For notebooks 1 and 3, follow the instructions in the last section of each notebook to build a new step in the project workflow.

screenshot-change-kernel

You’ll find the details of these notebooks and the associated outputs in the following sections: