Running unit tests on project libraries#

Dataiku’s project libraries allow you to centralize your code and easily call it from various places such as recipes or notebooks. However, as your project’s code base grows, you may want to assess its robustness by testing it.

In this tutorial, you will create and run simple unit tests on transformations applied to a dataset.

Prerequisites#

Dataiku >= 11.0
Access to a project with the following permissions:
- Read project content
- Run scenarios
Access to a code environment with the following packages:
- pytest==7.2.2

Preparing the code and the data#

In the Dataiku web interface, create a dataset from the UCI Bike Sharing dataset available online. There are two files in the archive to download: use only the hour.csv file to create a dataset in Dataiku and name it BikeSharingData.

Then, in your project library create a new directory called bike_sharing under lib/python/, and add two files to it:

__init__.py and leave it empty
prepare.py

The prepare.py file will contain the logic of our data transformation packaged into functions. There will be two of them:

with_temp_fahrenheit() will de-normalize the temperature data from the “temp” column and then convert it from Celsius to Fahrenheit degrees.
with_datetime()will combine the date and hour data from the “dteday” and “hr” columns into a single “datetime” column in the ISO 8601 format.

Here is the code for those functions:

prepare.py#

import pandas as pd
from datetime import datetime

# Useful to de-normalize tenperature.
# Check https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset for more details
TEMP_MIN_C = -8.0
TEMP_MAX_C = 39.0

def with_temp_fahrenheit(df: pd.DataFrame, temp_col: str) -> pd.DataFrame:
    """
    Denormalize temperature then convert it to Fahrenheit degrees.

    Args:
        df (pd.DataFrame): Input pandas DataFrame to chain on.
        temp_col (str): DataFrame column name for normalized temperature.
    
    Returns:
        A pandas DataFrame with a new column called "temp_F" containing
        de-normalized temperature in Fahrenheit degrees.
    """
    df["temp_F"] = 1.8 * (TEMP_MIN_C+df[temp_col].astype(float) * (TEMP_MAX_C-TEMP_MIN_C)) + 32.0
    return df

def with_datetime(df: pd.DataFrame,
                  date_col: str,
                  hour_col: str) -> str:
    """
    Create a proper datetime column.

    Args:
        df (pd.DataFrame): Input pandas DataFrame to chain on.
        date_col (str) : column name in df containing date value (e.g. 2023-04-29)
        hour_col (str): column name in df containing hour value (e.g. 21)
    
    Returns:
        A pandas DataFrame with a new column called "datetime" containing
        date + hour information in ISO8601 format.
    """

    df["datetime"] = (df[date_col] + ' ' + df[hour_col].astype(str)) \
        .apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H")) \
        .apply(datetime.isoformat)
    return df

You can now create a Python recipe taking BikeSharingData as input and a new output dataset called BikeSharingData_prepared. In that recipe, apply the transformations previously mentioned using pandas’ useful pipe() function:

recipe.py#

import dataiku
from bike_sharing import prepare

cols = ["dteday",
        "hr",
        "temp",
        "casual",
        "registered",
        "cnt"]

df = dataiku.Dataset("BikeSharingData") \
    .get_dataframe() \
    [cols] \
    .pipe(prepare.with_temp_fahrenheit, temp_col="temp") \
    .pipe(prepare.with_datetime, date_col="dteday", hour_col="hr")

output_dataset = dataiku.Dataset("BikeSharingData_prepared")
output_dataset.write_with_schema(df)

If you inspect the output dataset, you should see the newly-created columns.

Writing the tests#

Your Flow is now operational, but what if you wanted to ensure that your code meets expected behavior and errors can be caught earlier in the development process? To address those issues, you are going to write unit tests for the with_temp_fahrenheit() and with_datetime() functions.

In a nutshell, unit tests are assertions where you verify that atomic parts of your source code operate correctly. In our case, we’ll focus on testing data transformations by submitting sample input values to the function for which we know the outcome, and comparing the result of the function with that outcome.

To run these tests, you will rely on the pytest package, a popular testing framework. You’ll first need to set it up so that it doesn’t make tests fail because of deprecation warnings, which can sometimes occur but should not be blocking. For that, go back to your project library and inside the bike_sharing tutorial create a new file called pytest.ini with the following content:

pytest.ini#

[pytest]
filterwarnings = 
    ignore::DeprecationWarning

Next, still in the bike_sharing directory, create the file containing the code for your tests. Following the pytest conventions, it should be prefixed with test_, so call it test_prepare.py.

You’ll need to define sample data to run our tests on; the easiest way is to define a 1-record pandas DataFrame following the same schema as the BikeSharingData dataset:

test_prepare.py#

dummy_df = pd.DataFrame({
        "instant": "1",
        "dteday": "2023-01-01",
        "season": "1",
        "yr": "0",
        "mnth": "1",
        "hr": "13",
        "holiday": "0",
        "weekday": "1",
        "workingday": "0",
        "weathersit": "1",
        "temp": "1.0",
        "atemp": "0.5",
        "hum": "0.8",
        "windspeed": "0",
        "casual": "3",
        "registered": "10",
        "cnt": "13"
    }, index=[0])

Now let’s think a bit about how to test our functions.

Temperature conversion#

For with_temp_fahrenheit(), suppose we start with a normalized temperature of 1.0, which should translate into the upper boundary set as 39.0 by the dataset documentation. The Celsius-to-Fahrenheit conversion formula is simple: multiply by 1.8, then add 32.0, so 39.0 degrees Celsius equal \(1.8 \times 39.0 + 32.0 = 102.2\) degrees Fahrenheit.

In practice, it translates into the following code:

test_prepare.py#

def test_with_temp_fahrenheit():
    out = dummy_df.pipe(with_temp_fahrenheit,
                        temp_col="temp")
    assert out["temp_F"][0] == pytest.approx(102.2)

Date formatting#

For with_datetime(), we’ll refer to the ISO 8601 norm: if the date is 2023-01-01 and the hour is 13, then the resulting ISO 8601 date should be 2023-01-01T13:00:00

Note that the time is assumed to be local to keep thing simple, so there is no time zone designator in the formatted date.

In practice, it translates into the following code:

test_prepare.py#

def test_with_datetime():
    out = dummy_df.pipe(with_datetime, date_col="dteday", hour_col="hr")
    expected_dt = datetime.datetime(2023, 1, 1, 13, 0)
    out_dt = datetime.datetime.fromisoformat(out["datetime"][0])
    assert out_dt == expected_dt

The entire content of your test_prepare.py file should now look like this:

test_prepare.py#

from bike_sharing.prepare import with_temp_fahrenheit
from bike_sharing.prepare import with_datetime
import datetime

import pandas as pd
import pytest

dummy_df = pd.DataFrame({
        "instant": "1",
        "dteday": "2023-01-01",
        "season": "1",
        "yr": "0",
        "mnth": "1",
        "hr": "13",
        "holiday": "0",
        "weekday": "1",
        "workingday": "0",
        "weathersit": "1",
        "temp": "1.0",
        "atemp": "0.5",
        "hum": "0.8",
        "windspeed": "0",
        "casual": "3",
        "registered": "10",
        "cnt": "13"
    }, index=[0])

def test_with_temp_fahrenheit():
    out = dummy_df.pipe(with_temp_fahrenheit,
                        temp_col="temp")
    assert out["temp_F"][0] == pytest.approx(102.2)

def test_with_datetime():
    out = dummy_df.pipe(with_datetime, date_col="dteday", hour_col="hr")
    expected_dt = datetime.datetime(2023, 1, 1, 13, 0)
    out_dt = datetime.datetime.fromisoformat(out["datetime"][0])
    assert out_dt == expected_dt

Our tests are now ready! The only thing that is missing is scheduling their execution.

Running tests#

There are several ways to schedule the execution of your tests, from a purely manual approach to running a full-fledged CI pipeline. In this tutorial you will take a simple approach by regrouping the execution of the tests and the build of the BikeSharingData_prepared into a Dataiku scenario. Concretely, that scenario will build the dataset only if all tests pass: this way, you are able to guard against unintended behavior in your data transformation effectively.

Go to Scenario > New Scenario > Sequence of steps, call your scenario “Test and build” then click on Create. Then go to Steps > Add step > Execute Python code, and in the “Script” field enter the following code:

step.py#

import pytest
import bike_sharing

from pathlib import Path

lib_path = str(Path(bike_sharing.__file__).parent)

ret = pytest.main(["-x", lib_path])
if ret !=0:
    raise Exception("Tests failed!")

If you are already familiar with pytest, you probably run your tests directly in a terminal with the pytest command. The same result is achieved here by calling the pytest.main() function within your Python code. Note that it requires the absolute path of your project library directory to retrieve the test code and configuration files stored here in the lib_path variable.

Don’t forget that you will need to run this code with the code environment that contains the pytest package! For that, select it in the Environment dropdown list.

The only thing missing now is to build your dataset: go to Add step > Build/Train, then click on Add dataset to build and select BikeSharingData_prepared.

Your scenario is now complete! You can check if it works properly by clicking on Run: it will launch a manual run that should run successfully. If you want to inspect the pytest output, go to Last runs, select the desired run then next to “Custom Python” click on “View step log.”

Wrapping up#

Congratulations, you have written your first unit tests in Dataiku! In this tutorial, you have defined functions and unit tests for data processing, embedded them into project libraries then automated the test executions using a scenario.

You can now extend that logic and write tests to check the data itself, this tutorial has a few examples on how to implement data quality checks.