Experiment Tracking#
For usage information and examples, see Experiment Tracking.
Experiment Tracking in DSS uses the MLflow Tracking API.
This section focuses on Dataiku-specific Extensions to the MLflow API
Get a handle to interact with the extension of MLflow provided by DSS |
- class dataikuapi.dss.mlflow.DSSMLflowExtension(client, project_key)#
A handle to interact with specific endpoints of the DSS MLflow integration.
Do not create this directly, use
dataikuapi.dss.project.DSSProject.get_mlflow_extension()
- list_models(run_id)#
Returns the list of models of given run
- Parameters:
run_id (str) – run_id for which to return a list of models
- list_experiments(view_type='ACTIVE_ONLY', max_results=1000)#
Returns the list of experiments in the DSS project for which MLflow integration is setup
- Parameters:
view_type (str) – ACTIVE_ONLY, DELETED_ONLY or ALL
max_results (int) – max results count
- Return type:
dict
- rename_experiment(experiment_id, new_name)#
Renames an experiment
- Parameters:
experiment_id (str) – experiment id
new_name (str) – new name
- restore_experiment(experiment_id)#
Restores a deleted experiment
- Parameters:
experiment_id (str) – experiment id
- restore_run(run_id)#
Restores a deleted run
- Parameters:
run_id (str) – run id
- garbage_collect()#
Permanently deletes the experiments and runs marked as “Deleted”
- create_experiment_tracking_dataset(dataset_name, experiment_ids=[], view_type='ACTIVE_ONLY', filter_expr='', order_by=[], format='LONG')#
Creates a virtual dataset exposing experiment tracking data.
- Parameters:
dataset_name (str) – name of the dataset
experiment_ids (list(str)) – list of ids of experiments to filter on. No filtering if empty
view_type (str) – one of ACTIVE_ONLY, DELETED_ONLY and ALL. Default is ACTIVE_ONLY
filter_expr (str) – MLflow search expression
order_by (list(str)) – list of order by clauses. Default is ordered by start_time, then runId
format (str) – LONG or JSON. Default is LONG
- clean_experiment_tracking_db()#
Cleans the experiments, runs, params, metrics, tags, etc. for this project
This call requires an API key with admin rights
- set_run_inference_info(run_id, prediction_type, classes=None, code_env_name=None, target=None)#
Sets the type of the model, and optionally other information useful to deploy or evaluate it.
prediction_type must be one of:
REGRESSION
BINARY_CLASSIFICATION
MULTICLASS
OTHER
Classes must be specified if and only if the model is a BINARY_CLASSIFICATION or MULTICLASS model.
This information is leveraged to filter saved models on their prediction type and prefill the classes when deploying (using the GUI or
deploy_run_model()
) an MLflow model as a version of a DSS Saved Model.- Parameters:
prediction_type (str) – prediction type (see doc)
run_id (str) – run_id for which to set the classes
classes (list) – ordered list of classes (not for all prediction types, see doc). Every class will be converted by calling str(). The classes must be specified in the same order as learned by the model. Some flavors such as scikit-learn may allow you to build this list from the model itself.
code_env_name (str) – name of an adequate DSS python code environment
target (str) – name of the target
- deploy_run_model(run_id, sm_id, version_id=None, use_inference_info=True, code_env_name=None, evaluation_dataset=None, target_column_name=None, class_labels=None, model_sub_folder=None, selection=None, activate=True, binary_classification_threshold=0.5, use_optimal_threshold=True, skip_expensive_reports=False)#
Deploys a model from an experiment run, with lineage.
Simple usage:
mlflow_ext.set_run_inference_info(run_id, "BINARY_CLASSIFICATION", list_of_classes, code_env_name, target_column_name) sm_id = project.create_mlflow_pyfunc_model("model_name", "BINARY_CLASSIFICATION").id mlflow_extension.deploy_run_model(run_id, sm_id, evaluation_dataset)
If the optional evaluation_dataset is not set, the model will be deployed but not evaluated: this makes target_column_name optional as well in
set_run_inference_info()
- Parameters:
run_id (str) – The id of the run to deploy
sm_id (str) – The id of the saved model to deploy the run to
version_id (str) – [optional] Unique identifier of a Saved Model Version. If it already exists, existing version is overwritten. Whitespaces or dashes are not allowed. If not set, a timestamp will be used as version_id.
use_inference_info (bool) – [optional] default to True. if set, uses the
set_inference_info()
previously done on the run to retrieve the prediction type of the model, its code environment, classes and target.evaluation_dataset (str) – [optional] The evaluation dataset, if the deployment of the models can imply an evaluation.
target_column_name (str) – [optional] The target column of the evaluation dataset. Can be set by
set_inference_info()
.class_labels (list(str)) – [optional] The class labels of the target. Can be set by
set_inference_info()
.code_env_name (str) – [optional] The code environment to be used. Must contain a supported version of the mlflow package and the ML libs used to train the model. Can be set by
set_inference_info()
.model_sub_folder (str) – [optional] The name of the subfolder containing the model. Optional if it is unique. Existing values can be retrieved with project.get_mlflow_extension().list_models(run_id)
selection (
DSSDatasetSelectionBuilder
optional sampling parameter for the evaluation or dict) –[optional] will default to HEAD_SEQUENTIAL with a maxRecords of 10_000. e.g.
Example 1:
DSSDatasetSelectionBuilder().with_head_sampling(100)
Example 2:
{"samplingMethod": "HEAD_SEQUENTIAL", "maxRecords": 100}
activate (bool) – [optional] True by default. Activate or not the version after deployment
binary_classification_threshold (float) – [optional] Threshold (or cut-off) value to override if the model is a binary classification
use_optimal_threshold (bool) – [optional] Use or not the optimal threshold for the saved model metric computed at evaluation
skip_expensive_reports (bool) – [optional] Don’t compute expensive report screens (e.g. feature importance).
- Returns:
a handler in order to interact with the new MLFlow model version
- Return type:
- import_analyses_models_into_experiment(model_ids, experiment_id)#
Import models from a visual ML analysis into an existing experiment. import dataiku
Usage example
# Retrieve all the trained model ids of the first task of the first analysis of a project project = client.get_project("YOUR_PROJECT_ID") first_analysis_id = project.list_analyses()[0]['analysisId'] first_analysis = project.get_analysis(first_analysis_id) first_task_id = first_analysis.list_ml_tasks()['mlTasks'][0]['mlTaskId'] first_task = first_analysis.get_ml_task(first_task_id) full_model_ids = first_task.get_trained_models_ids() # Create a new experiment with project.setup_mlflow(project.create_managed_folder("mlflow")) as mlflow: experiment_id = mlflow.create_experiment("Sample export of DSS visual analysis models") # Export the retrieved model ids to the created experiment project.get_mlflow_extension().import_analyses_models_into_experiment(full_model_ids, experiment_id)
- Parameters:
model_ids (list of str) – IDs of models from a Visual Analysis.
experiment_id (str) – ID of the experiment into which the visual analysis models will be imported.