Experiment tracking with Catboost#

In this tutorial you will train a model using the Catboost framework and use the experiment tracking abilities of Dataiku to log training runs (parameters, performance).

Prerequisites#

  • Access to a Project with a Dataset that contains the UCI Bank Marketing data

  • A Code Environment containing the mlflow and catboost packages

Performing the experiment#

The following code snippet provides a reusable example to train a simple gradient boosting model, with these main steps:

(1): Select the features and target variables.

(2): Define the hyperparameters to run the training on. Set the number of boosting rounds to 100, and to check whether overfitting is occuring during cross-validation, set early_stopping_rounds to 5. To cap boosting rounds, limit the training to the iteration that has the best score by setting use_best_model to True.

(3): Perform the experiment run, log the hyperparameters, performance metrics (here we use the AUC) and the trained model.

import dataiku
from catboost import CatBoostClassifier, Pool, cv

# !! - Replace these values by your own - !!
USER_PROJECT_KEY = ""
USER_XPTRACKING_FOLDER_ID = ""
USER_EXPERIMENT_NAME = ""
USER_TRAINING_DATASET = ""
USER_MLFLOW_CODE_ENV_NAME = ""

client = dataiku.api_client()
project = client.get_project(USER_PROJECT_KEY)

# (1)
ds = dataiku.Dataset(USER_TRAINING_DATASET)
df = ds.get_dataframe()

cat_features= ["job", "marital", "education", "default",
    "housing","loan", "month", "contact", "poutcome"]

target ="y"

X = df.drop(target, axis=1)
y = df[target]

# (2)
params = {
    'iterations': 100,
    'learning_rate': 0.1, 
    'depth': 10,
    'cat_features': cat_features,
    'loss_function': 'Logloss',
    'eval_metric': 'AUC',
    'early_stopping_rounds': 5,
    'use_best_model': True,
    'random_seed': 42,
}

# (3)
mf = project.get_managed_folder(USER_XPTRACKING_FOLDER_ID)
mlflow_extension = project.get_mlflow_extension()

with project.setup_mlflow(mf) as mlflow:
    mlflow.set_experiment(experiment_name=USER_EXPERIMENT_NAME)
    with mlflow.start_run() as run:
        run_id = run.info.run_id
        
        cv_dataset = Pool(
            data=X, label=y, cat_features= cat_features)

        scores = cv(cv_dataset,
                    params,
                    fold_count=5,
                    seed=42,
                    plot= False)
        
        for x in range(len(scores.index)):
            mlflow.log_metric(key='mean_AUC', value=scores['test-AUC-mean'][x], step=x)
            mlflow.log_metric(key='sd_AUC', value=scores['test-AUC-std'][x], step=x)

        mlflow.log_params(params=params)
        
        if params['early_stopping_rounds']:
            mlflow.log_metric(key='best_iteration', value=len(scores.index))
        
        if params['use_best_model']:
            params['iterations'] = len(scores.index)
            params['use_best_model'] = False

        model = CatBoostClassifier(**params)
        cb_model = model.fit(X,y)
        
        mlflow.catboost.log_model(cb_model, artifact_path="model")
    
        mlflow_extension.set_run_inference_info(run_id=run_id,
            prediction_type="BINARY_CLASSIFICATION",
            classes=['no', 'yes'],
            code_env_name=USER_MLFLOW_CODE_ENV_NAME,
            target=target)

After these steps you should have your Experiment Run’s data available both in the Dataiku UI and programmatically via the DSSMLflowExtension object of the Python API client.