Quickstart Tutorial#
In this tutorial, you’ll learn how to build a basic Machine Learning project in Dataiku, from data exploration to model development, using mainly Jupyter Notebooks.
Prerequisites#
Have access to a Dataiku 12+ instance.
Create a Python>=3.8 code environment named
py_quickstart
with the following required packages:
mlflow
scikit-learn>=1.0,<1.4
scipy<1.12.0
statsmodels
seaborn
Note
In Dataiku, the equivalent of virtual environments is called a “code environment.” The code environment documentation provides more information and instructions for creating a new Python code environment.
Installation#
Import the project#
On the Dataiku homepage, select + NEW PROJECT > DSS Tutorials. In the Quick Start section, select Developers Quick Start.
Alternatively, you can download the project from this page and then upload it to your Dataiku instance: + NEW PROJECT > Import project.
Set the code environment#
To ensure the code environment is automatically selected for running all the Python scripts in your project, we will change the project settings to use it by default.
On the top bar, select … > Settings > Code env selection.
In the Default Python code env:
Change Mode to
Select an environment
.In the Environment parameter, select the code environment you’ve just created.
Click the
Save
button or do aCtrl+S
Set up the project#
This tutorial comes with the following:
a
README.md
file (stored in the project Wiki)an input dataset: the Heart Failure Prediction Dataset
three Jupyter Notebooks that you will leverage to build the project
a Python repository stored in the project library, with some Python functions that will be used in the different notebooks:
utils/data_processing.py
"""data_processing.py
This file contains data preparation functions to process the heart measures dataset.
"""
import pandas as pd
def transform_heart_categorical_measures(df, chest_pain_colname, resting_ecg_colname,
exercise_induced_angina_colname, st_slope_colname, sex_colname):
"""
Transforms each category from the given categorical columns into int value, using specific replacement rules for each column.
:param pd.DataFrame df: the input dataset
:param str chest_pain_colname: the name of the column containing information relative to chest pain type
:param str resting_ecg_colname: the name of the column containing information relative to the resting electrocardiogram results
:param str exercise_induced_angina_colname: the name of the column containing information relative to exercise-induced angina
:param str st_slope_colname: the name of the column containing information relative to the slope of the peak exercise ST segment
:param str sex_colname: the name of the column containing information relative to the patient gender
:returns: the dataset with transform categorical columns
:rtype: pd.DataFrame
"""
df[chest_pain_colname].replace({'TA':1, 'ATA':2, 'NAP': 3, 'ASY': 4}, inplace=True)
df[resting_ecg_colname].replace({'Normal':0, 'ST':1, 'LVH':2}, inplace=True)
df[exercise_induced_angina_colname].replace({'N':0, 'Y':1}, inplace=True)
df[st_slope_colname].replace({'Down':0, 'Flat':1, 'Up':2}, inplace=True)
df[sex_colname].replace({'M': 0, 'F': 1}, inplace=True)
return df
utils/model_training.py
"""model_training.py
This file contains ML modeling functions to grid search best hyper parameters of a Scikit-Learn model and cross evaluate a model.
"""
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
def find_best_parameters(X, y, estimator, params, cv=5):
"""
Performs a grid search on the sklearn estimator over a set of hyper parameters and return the best hyper parameters.
:param pd.DataFrame X: The data to fit
:param pd.Series y: the target variable to predict
:param sklearn-estimator estimator: The scikit-learn model used to fit the data
:param dict params: the set of hyper parameters to search on
:param int cv: the number of folds to use for cross validation, default is 5
:returns: the best hyper parameters
:rtype: dict
"""
grid = GridSearchCV(estimator, params, cv=cv)
grid.fit(X, y)
return grid.best_params_
def cross_validate_scores(X, y, estimator, cv=5, scoring=['accuracy']):
"""
Performs a cross evaluation of the scikit learn model over n folds.
:param pd.DataFrame X: The data to fit
:param pd.Series y: the target variable to predict
:param sklearn-estimator estimator: The scikit-learn model used to fit the data
:param int cv: the number of folds to use for cross validation
:param list scoring: the list of performance metrics to use to evaluate the model
:returns: the average result for each performance metrics over the n folds
:rtype: dict
"""
cross_val = cross_validate(estimator, X, y, cv=cv, scoring=scoring)
metrics_result = {}
for metric in scoring:
metrics_result[metric] = np.mean(cross_val['test_'+metric])
return metrics_result
The project aims to build a binary predictive Machine Learning model to predict the risk of heart failure based on health information. For that, you’ll go through the standard steps of a Machine Learning project: data exploration, data preparation, machine learning modeling using different ML models, and model evaluation.
Instructions#
The project is composed of three notebooks (they can be found in the Notebooks
section: </> > Notebooks) that you will run one by one. For each notebook:
Ensure you use the code environment specified in the Prerequisites (
py_quickstart
). You can change the Python kernel in the Notebook menu under Kernel > Change Kernel.Run the notebook cell by cell.
For notebooks 1 and 3, follow the instructions in the last section of each notebook to build a new step in the project workflow.
You’ll find the details of these notebooks and the associated outputs in the following sections: