Quickstart Tutorial#

In this tutorial, you’ll learn how to build a basic Machine Learning project in Dataiku, from data exploration to machine learning model development, using mainly Jupyter Notebooks.

Prerequisites#

  • Have access to a Dataiku 11+ instance

  • Create a Python>=3.6 code environment named heart-attack-project with the following required packages:

    • seaborn==0.11.2

    • mlflow==1.23.1

    • mlflow[extras]

    • scikit-learn==0.24.2

    • protobuf==3.20.*

    Note

    In Dataiku, the equivalent of virtual environments is called a “code environment.” In the code environment documentation, you can find more information and instructions for creating a new Python code environment .

Installation#

1. Import the project

On the Dataiku homepage, select + NEW PROJECT > DSS Tutorials. In the Quick Start section, select Developers Quick Start.

Alternatively, you can download the project from this page and then upload the project on your Dataiku instance: + NEW PROJECT > Import project.

2. Set the code environment

To ensure the code environment is automatically selected for running all the Python scripts in your project, we will change the project settings to use it by default.

  • On the top bar, select … > Settings > Code env selection.

  • In the Default Python code env:

    • Change Mode to Select an environment.

    • In the Environment parameter, select the code environment you’ve just created.

    • Click the Save button or do a Ctrl+S

screenshot-code-env-settings

Set up the project#

This tutorial comes with the followings:

  • a README.md file (stored in the project Wiki)

  • an input dataset: the Heart Failure Prediction Dataset

  • three Jupyter Notebooks that you will leverage to build the project

  • a Python repository stored in the project library, with some Python functions that will be used in the different notebooks. The project aims to build a binary predictive Machine Learning model to predict the risk of heart failure based on health information. For that, you’ll go through the standard steps of a Machine Learning project: data exploration, data preparation, machine learning modeling using different ML models, and model evaluation.

Instructions#

The project is composed of three notebooks (they can be found in the Notebooks section: </> > Notebooks) that you will run one by one. For each notebook:

  1. Ensure you’re using the heart-attack-project code environment (see prerequisites above).

  2. Run the notebook cell by cell.

  3. For notebooks 1 and 3, follow the instructions in the last section of each notebook to build a new step in the project workflow.

You’ll find the details of these notebooks and the associated outputs in the following sections: