Experiment tracking with Catboost#
In this tutorial you will train a model using the Catboost framework and use the experiment tracking abilities of Dataiku to log training runs (parameters, performance).
Prerequisites#
Access to a Project with a Dataset that contains the UCI Bank Marketing data
A Code Environment containing the
mlflow
andcatboost
packages
Performing the experiment#
The following code snippet provides a reusable example to train a simple gradient boosting model, with these main steps:
(1): Select the features and target variables.
(2): Define the hyperparameters to run the training on. Set the number of boosting rounds to 100, and to check whether overfitting is occuring during cross-validation, set early_stopping_rounds
to 5. To cap boosting rounds, limit the training to the iteration that has the best score by setting use_best_model
to True
.
(3): Perform the experiment run, log the hyperparameters, performance metrics (here we use the AUC) and the trained model.
import dataiku
from catboost import CatBoostClassifier, Pool, cv
# !! - Replace these values by your own - !!
USER_PROJECT_KEY = ""
USER_XPTRACKING_FOLDER_ID = ""
USER_EXPERIMENT_NAME = ""
USER_TRAINING_DATASET = ""
USER_MLFLOW_CODE_ENV_NAME = ""
client = dataiku.api_client()
project = client.get_project(USER_PROJECT_KEY)
# (1)
ds = dataiku.Dataset(USER_TRAINING_DATASET)
df = ds.get_dataframe()
cat_features= ["job", "marital", "education", "default",
"housing","loan", "month", "contact", "poutcome"]
target ="y"
X = df.drop(target, axis=1)
y = df[target]
# (2)
params = {
'iterations': 100,
'learning_rate': 0.1,
'depth': 10,
'cat_features': cat_features,
'loss_function': 'Logloss',
'eval_metric': 'AUC',
'early_stopping_rounds': 5,
'use_best_model': True,
'random_seed': 42,
}
# (3)
mf = project.get_managed_folder(USER_XPTRACKING_FOLDER_ID)
mlflow_extension = project.get_mlflow_extension()
with project.setup_mlflow(mf) as mlflow:
mlflow.set_experiment(experiment_name=USER_EXPERIMENT_NAME)
with mlflow.start_run() as run:
run_id = run.info.run_id
cv_dataset = Pool(
data=X, label=y, cat_features= cat_features)
scores = cv(cv_dataset,
params,
fold_count=5,
seed=42,
plot= False)
for x in range(len(scores.index)):
mlflow.log_metric(key='mean_AUC', value=scores['test-AUC-mean'][x], step=x)
mlflow.log_metric(key='sd_AUC', value=scores['test-AUC-std'][x], step=x)
mlflow.log_params(params=params)
if params['early_stopping_rounds']:
mlflow.log_metric(key='best_iteration', value=len(scores.index))
if params['use_best_model']:
params['iterations'] = len(scores.index)
params['use_best_model'] = False
model = CatBoostClassifier(**params)
cb_model = model.fit(X,y)
mlflow.catboost.log_model(cb_model, artifact_path="model")
mlflow_extension.set_run_inference_info(run_id=run_id,
prediction_type="BINARY_CLASSIFICATION",
classes=['no', 'yes'],
code_env_name=USER_MLFLOW_CODE_ENV_NAME,
target=target)
After these steps you should have your Experiment Run’s data available both in the Dataiku UI and programmatically via the DSSMLflowExtension
object of the Python API client.