Experiment Tracking#

For usage information and examples, see Experiment Tracking.

Experiment Tracking in DSS uses the MLflow Tracking API.

This section focuses on Dataiku-specific Extensions to the MLflow API

dataikuapi.dss.project.DSSProject.get_mlflow_extension()

Get a handle to interact with the extension of MLflow provided by DSS

class dataikuapi.dss.mlflow.DSSMLflowExtension(client, project_key)#

A handle to interact with specific endpoints of the DSS MLflow integration.

Do not create this directly, use dataikuapi.dss.project.DSSProject.get_mlflow_extension()

list_models(run_id)#

Returns the list of models of given run

Parameters:

run_id (str) – run_id for which to return a list of models

list_experiments(view_type='ACTIVE_ONLY', max_results=1000)#

Returns the list of experiments in the DSS project for which MLflow integration is setup

Parameters:
  • view_type (str) – ACTIVE_ONLY, DELETED_ONLY or ALL

  • max_results (int) – max results count

Return type:

dict

rename_experiment(experiment_id, new_name)#

Renames an experiment

Parameters:
  • experiment_id (str) – experiment id

  • new_name (str) – new name

restore_experiment(experiment_id)#

Restores a deleted experiment

Parameters:

experiment_id (str) – experiment id

restore_run(run_id)#

Restores a deleted run

Parameters:

run_id (str) – run id

garbage_collect()#

Permanently deletes the experiments and runs marked as “Deleted”

create_experiment_tracking_dataset(dataset_name, experiment_ids=[], view_type='ACTIVE_ONLY', filter_expr='', order_by=[], format='LONG')#

Creates a virtual dataset exposing experiment tracking data.

Parameters:
  • dataset_name (str) – name of the dataset

  • experiment_ids (list(str)) – list of ids of experiments to filter on. No filtering if empty

  • view_type (str) – one of ACTIVE_ONLY, DELETED_ONLY and ALL. Default is ACTIVE_ONLY

  • filter_expr (str) – MLflow search expression

  • order_by (list(str)) – list of order by clauses. Default is ordered by start_time, then runId

  • format (str) – LONG or JSON. Default is LONG

clean_experiment_tracking_db()#

Cleans the experiments, runs, params, metrics, tags, etc. for this project

This call requires an API key with admin rights

set_run_inference_info(run_id, prediction_type, classes=None, code_env_name=None, target=None)#

Sets the type of the model, and optionally other information useful to deploy or evaluate it.

prediction_type must be one of:

  • REGRESSION

  • BINARY_CLASSIFICATION

  • MULTICLASS

  • OTHER

Classes must be specified if and only if the model is a BINARY_CLASSIFICATION or MULTICLASS model.

This information is leveraged to filter saved models on their prediction type and prefill the classes when deploying (using the GUI or deploy_run_model()) an MLflow model as a version of a DSS Saved Model.

Parameters:
  • prediction_type (str) – prediction type (see doc)

  • run_id (str) – run_id for which to set the classes

  • classes (list) – ordered list of classes (not for all prediction types, see doc). Every class will be converted by calling str(). The classes must be specified in the same order as learned by the model. Some flavors such as scikit-learn may allow you to build this list from the model itself.

  • code_env_name (str) – name of an adequate DSS python code environment

  • target (str) – name of the target

deploy_run_model(run_id, sm_id, version_id=None, use_inference_info=True, code_env_name=None, evaluation_dataset=None, target_column_name=None, class_labels=None, model_sub_folder=None, selection=None, activate=True, binary_classification_threshold=0.5, use_optimal_threshold=True, skip_expensive_reports=False)#

Deploys a model from an experiment run, with lineage.

Simple usage:

mlflow_ext.set_run_inference_info(run_id, "BINARY_CLASSIFICATION", list_of_classes, code_env_name, target_column_name)
sm_id = project.create_mlflow_pyfunc_model("model_name", "BINARY_CLASSIFICATION").id
mlflow_extension.deploy_run_model(run_id, sm_id, evaluation_dataset)

If the optional evaluation_dataset is not set, the model will be deployed but not evaluated: this makes target_column_name optional as well in set_run_inference_info()

Parameters:
  • run_id (str) – The id of the run to deploy

  • sm_id (str) – The id of the saved model to deploy the run to

  • version_id (str) – [optional] Unique identifier of a Saved Model Version. If it already exists, existing version is overwritten. Whitespaces or dashes are not allowed. If not set, a timestamp will be used as version_id.

  • use_inference_info (bool) – [optional] default to True. if set, uses the set_inference_info() previously done on the run to retrieve the prediction type of the model, its code environment, classes and target.

  • evaluation_dataset (str) – [optional] The evaluation dataset, if the deployment of the models can imply an evaluation.

  • target_column_name (str) – [optional] The target column of the evaluation dataset. Can be set by set_inference_info().

  • class_labels (list(str)) – [optional] The class labels of the target. Can be set by set_inference_info().

  • code_env_name (str) – [optional] The code environment to be used. Must contain a supported version of the mlflow package and the ML libs used to train the model. Can be set by set_inference_info().

  • model_sub_folder (str) – [optional] The name of the subfolder containing the model. Optional if it is unique. Existing values can be retrieved with project.get_mlflow_extension().list_models(run_id)

  • selection (DSSDatasetSelectionBuilder optional sampling parameter for the evaluation or dict) –

    [optional] will default to HEAD_SEQUENTIAL with a maxRecords of 10_000. e.g.

    • Example 1: DSSDatasetSelectionBuilder().with_head_sampling(100)

    • Example 2: {"samplingMethod": "HEAD_SEQUENTIAL", "maxRecords": 100}

  • activate (bool) – [optional] True by default. Activate or not the version after deployment

  • binary_classification_threshold (float) – [optional] Threshold (or cut-off) value to override if the model is a binary classification

  • use_optimal_threshold (bool) – [optional] Use or not the optimal threshold for the saved model metric computed at evaluation

  • skip_expensive_reports (bool) – [optional] Don’t compute expensive report screens (e.g. feature importance).

Returns:

a handler in order to interact with the new MLFlow model version

Return type:

dataikuapi.dss.savedmodel.ExternalModelVersionHandler

import_analyses_models_into_experiment(model_ids, experiment_id)#

Import models from a visual ML analysis into an existing experiment. import dataiku

Usage example

# Retrieve all the trained model ids of the first task of the first analysis of a project
project = client.get_project("YOUR_PROJECT_ID")
first_analysis_id = project.list_analyses()[0]['analysisId']
first_analysis = project.get_analysis(first_analysis_id)
first_task_id = first_analysis.list_ml_tasks()['mlTasks'][0]['mlTaskId']
first_task = first_analysis.get_ml_task(first_task_id)
full_model_ids = first_task.get_trained_models_ids()
# Create a new experiment
with project.setup_mlflow(project.create_managed_folder("mlflow")) as mlflow:
    experiment_id = mlflow.create_experiment("Sample export of DSS visual analysis models")
# Export the retrieved model ids to the created experiment
project.get_mlflow_extension().import_analyses_models_into_experiment(full_model_ids, experiment_id)
Parameters:
  • model_ids (list of str) – IDs of models from a Visual Analysis.

  • experiment_id (str) – ID of the experiment into which the visual analysis models will be imported.