Machine learning#

For usage information and examples, see Visual Machine learning

API Reference#

Interaction with a ML Task#

class dataikuapi.dss.ml.DSSMLTask(client, project_key, analysis_id, mltask_id)#

A handle to interact with a ML Task for prediction or clustering in a DSS visual analysis.

Important

To create a new ML Task, use one of the following methods: dataikuapi.dss.project.DSSProject.create_prediction_ml_task(), dataikuapi.dss.project.DSSProject.create_clustering_ml_task() or dataikuapi.dss.project.DSSProject.create_timeseries_forecasting_ml_task().

static from_full_model_id(client, fmi, project_key=None)#

Static method returning a DSSMLTask object representing a pre-existing ML Task

Parameters:

client (DSSClient) – An instantiated DSSClient
fmi (str) – The Full Model ID of the ML Task
project_key (str, optional) – The project key of the project containing the ML Task (defaults to None)

Returns:

The DSSMLTask

Return type:

DSSMLTask

delete()#: Deletes the ML task

wait_guess_complete()#

Waits for the ML Task guessing to be complete.

This should be called immediately after the creation of a new ML Task if the ML Task was created with wait_guess_complete = False, before calling get_settings() or train().

get_status()#

Gets the status of this ML Task

Returns:: A dictionary containing the ML Task status
Return type:: dict

get_settings()#

Gets the settings of this ML Task.

This should be used whenever you need to modify the settings of an existing ML Task.

Returns:: A DSSMLTaskSettings object.
Return type:: dataikuapi.dss.ml.DSSMLTaskSettings

train(session_name=None, session_description=None, run_queue=False)#

Trains models for this ML Task.

This method waits for training to complete. If you instead want to train asynchronously, use start_train() and wait_train_complete().

This method returns a list of trained model identifiers. These refer to models that have been trained during this specific training session, rather than all of the trained models available on this ML task. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids().

These identifiers can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow().

Parameters:

session_name (str, optional) – Optional name for the session (defaults to None)
session_description (str, optional) – Optional description for the session (defaults to None)
run_queue (bool) – Whether to run any queued sessions after the training completes (defaults to False)

Returns:

A list of model identifiers

Return type:

list[str]

ensemble(model_ids, method)#

Creates an ensemble model from a set of models.

This method waits for the ensemble training to complete. If you want to train asynchronously, use start_ensembling() and wait_train_complete().

This method returns the identifier of the trained ensemble. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids().

This returned identifier can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow().

Parameters:

model_ids (list[str]) – A list of model identifiers to ensemble (must not be empty)
method (str) – The ensembling method. Must be one of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL

Returns:

The model identifier of the resulting ensemble model

Return type:

str

start_train(session_name=None, session_description=None, run_queue=False)#

Starts asynchronously a new training session for this ML Task.

This returns immediately, before training is complete. To wait for training to complete, use wait_train_complete().

Parameters:

session_name (str, optional) – Optional name for the session (defaults to None)
session_description (str, optional) – Optional description for the session (defaults to None)
run_queue (bool) – Whether to run any queued sessions after the training completes (defaults to False)

start_ensembling(model_ids, method)#

Creates asynchronously an ensemble model from a set of models

This returns immediately, before training is complete. To wait for training to complete, use wait_train_complete()

Parameters:

model_ids (list[str]) – A list of model identifiers to ensemble (must not be empty)
method (str) – The ensembling method. Must be one of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL

Returns:

The model identifier of the ensemble

Return type:

str

wait_train_complete()#

Waits for training to be completed

To be used following any asynchronous training started with start_train() or start_ensembling()

get_trained_models_ids(session_id=None, algorithm=None)#

Gets the list of trained model identifiers for this ML task.

These identifiers can be used for get_trained_model_snippet() and deploy_to_flow().

The two optional filter params can be used together.

Parameters:

session_id (str, optional) – Optional filter to return only IDs of models from a specific session.
algorithm (str, optional) – Optional filter to return only IDs of models with a specific algorithm.

Returns:

A list of model identifiers

Return type:

list[str]

get_trained_model_snippet(id=None, ids=None)#

Gets a quick summary of a trained model, as a dict.

This method can either be given a single model id, via the id param, or a list of model ids, via the ids param.

For complete model information and a structured object, use get_trained_model_details().

Parameters:

id (str, optional) – A model id (defaults to None)
ids (list[str]) – A list of model ids (defaults to None)

Returns:

Either a quick summary of one trained model as a dict, or a list of model summary dicts

Return type:

Union[dict, list[dict]]

get_trained_model_details(id)#

Gets details for a trained model.

Parameters:: id (str) – Identifier of the trained model, as returned by get_trained_models_ids()
Returns:: A DSSTrainedPredictionModelDetails or DSSTrainedClusteringModelDetails representing the details of this trained model.
Return type:: Union[DSSTrainedPredictionModelDetails, DSSTrainedClusteringModelDetails, DSSTrainedTimeseriesForecastingModelDetails]

delete_trained_model(model_id)#

Deletes a trained model

Parameters:: model_id (str) – Model identifier, as returned by get_trained_models_ids().

train_queue()#

Trains each session in this ML Task’s queue, or until the queue is paused.

Returns:: A dict including the next sessionID to be trained in the queue
Return type:: dict

deploy_to_flow(model_id, model_name, train_dataset, test_dataset=None, redo_optimization=True)#

Deploys a trained model from this ML Task to the flow.

Creates a new saved model and its parent training recipe in the Flow.

Parameters:

model_id (str) – Model identifier, as returned by get_trained_models_ids()
model_name (str) – Name of the saved model when deployed to the Flow
train_dataset (str) – Name of the dataset to use as train set. May either be a short name or a PROJECT.name long name (when using a shared dataset)
test_dataset (str, optional) – Name of the dataset to use as test set. If None (default), the train/test split will be applied over the train set. Only for PREDICTION tasks. May either be a short name or a PROJECT.name long name (when using a shared dataset).
redo_optimization (bool) – Whether to redo the hyperparameter optimization phase (defaults to True). Only for PREDICTION tasks.

Returns:

A dict containing: “savedModelId” and “trainRecipeName” - Both can be used to obtain further handles

Return type:

dict

redeploy_to_flow(model_id, recipe_name=None, saved_model_id=None, activate=True, redo_optimization=False, redo_threshold_optimization=True, fixed_threshold=None)#

Redeploys a trained model from this ML Task to an existing saved model and training recipe in the flow.

Either the training recipe recipe_name or the saved_model_id needs to be specified.

Parameters:

model_id (str) – Model identifier, as returned by get_trained_models_ids()
recipe_name (str, optional) – Name of the training recipe to update (defaults to None)
saved_model_id (str, optional) – Name of the saved model to update (defaults to None)
activate (bool) – If True (default), make the newly deployed model version become the active version
redo_optimization (bool) – Whether to re-run the model optimization (hyperparameter search) on every train
redo_threshold_optimization (bool) – Whether to redo the model threshold Optimization on every train (for binary classification models)
fixed_threshold (float, optional) – Value to use as fixed threshold. Must be set if redoThresholdOptimization is False (for binary classification models)

Returns:

A dict containing: “impactsDownstream” - whether the active saved mode version changed and downstream recipes are impacted

Return type:

dict

remove_unused_splits()#

Deletes all stored split data that is no longer in use for this ML Task.

You should generally not need to call this method.

remove_all_splits()#

Deletes all stored split data for this ML Task.

This operation saves disk space.

After performing this operation, it will not be possible anymore to:

Ensemble already trained models
View the “predicted data” or “charts” for already trained models
Resume training of models for which optimization had been previously interrupted

Training new models remains possible

guess(prediction_type=None, reguess_level=None, target_variable=None, timeseries_identifiers=None, time_variable=None, full_reguess=None)#

Reguess the settings of the ML Task.

When no optional parameters are given, this will reguess all the settings of the ML Task.

For prediction ML tasks only, a new target variable or prediction type can be passed, and this will subsequently reguess the impacted settings.

Parameters:

prediction_type (str, optional) – The desired prediction type. Only valid for prediction tasks of either BINARY_CLASSIFICATION, MULTICLASS or REGRESSION type, ignored otherwise. Cannot be set if either target_variable, time_variable, or timeseries_identifiers is also specified. (defaults to None)
target_variable (str, optional) – The desired target variable. Only valid for prediction tasks, ignored for clustering. Cannot be set if either prediction_type, time_variable, or timeseries_identifiers is also specified. (defaults to None)
timeseries_identifiers (list[str], optional) – Only valid for time series forecasting tasks. List of columns to be used as time series identifiers. Cannot be set if either prediction_type, target_variable, or time_variable is also specified. (defaults to None)
time_variable (str, optional) – The desired time variable column. Only valid for time series forecasting tasks. Cannot be set if either prediction_type, target_variable, or timeseries_identifiers is also specified. (defaults to None)
full_reguess (bool, optional) – Scope of the reguess process: whether it should reguess all the settings after changing a core parameter, or only reguess impacted settings (e.g. target remapping when changing the target, metrics when changing the prediction type…). Ignored if no core parameter is given. Only valid for prediction tasks and therefore also ignored for clustering. (defaults to True)
reguess_level (str, optional) –
Deprecated, use full_reguess instead. Only valid for prediction tasks. Can be one of the following values:
- TARGET_CHANGE: Change the target if target_variable is specified, reguess the target remapping, and clear the model’s assertions if any. Equivalent to full_reguess=False (recommended usage)
- FULL_REGUESS: All the settings of the ML task are reguessed. Equivalent to full_reguess=True (recommended usage)

Manipulation of settings#

class dataikuapi.dss.ml.HyperparameterSearchSettings(raw_settings)#

Object to read and modify hyperparameter search settings.

This is available for all non-clustering ML Tasks.

Important

Do not create this class directly, use AbstractTabularPredictionMLTaskSettings.get_hyperparameter_search_settings()

property strategy#

Returns:: The hyperparameter search strategy. Will be one of “GRID” | “RANDOM” | “BAYESIAN”.
Return type:: str

set_grid_search(shuffle=True, seed=1337)#

Sets the search strategy to “GRID”, to perform a grid-search over the hyperparameters.

Parameters:

shuffle (bool) – if True (default), iterate over a shuffled grid as opposed to lexicographical iteration over the cartesian product of the hyperparameters
seed (int) – Seed value used to ensure reproducible results (defaults to 1337)

set_random_search(seed=1337)#

Sets the search strategy to “RANDOM”, to perform a random search over the hyperparameters.

Parameters:: seed (int) – Seed value used to ensure reproducible results (defaults to 1337)

set_bayesian_search(seed=1337)#

Sets the search strategy to “BAYESIAN”, to perform a Bayesian search over the hyperparameters.

Parameters:: seed (int) – Seed value used to ensure reproducible results (defaults to 1337)

property validation_mode#

Returns:: The cross-validation strategy. Will be one of “KFOLD” | “SHUFFLE” | “TIME_SERIES_KFOLD” | “TIME_SERIES_SINGLE_SPLIT” | “CUSTOM”.
Return type:: str

property fold_offset#

Returns:: Whether there is an offset between validation sets, to avoid overlap between cross-test sets (model evaluation) and cross-validation sets (hyperparameter search), if both are using k-fold. Only relevant for time series forecasting
Return type:: bool

property equal_duration_folds#

Returns:: Whether every fold in cross-test and cross-validation should be of equal duration when using k-fold. Only relevant for time series forecasting.
Return type:: bool

property cv_seed#

Returns:: cross-validation seed for splitting the data during hyperparameter search
Return type:: int

set_kfold_validation(n_folds=5, stratified=True, cv_seed=1337)#

Sets the validation mode to k-fold cross-validation.

The mode will be set to either “KFOLD” or “TIME_SERIES_KFOLD”, depending on whether time-based ordering is enabled.

Parameters:

n_folds (int) – The number of folds used for the hyperparameter search (defaults to 5)
stratified (bool) – If True, keep the same proportion of each target classes in all folds (defaults to True)
cv_seed (int) – Seed for cross-validation (defaults to 1337)

set_single_split_validation(split_ratio=0.8, stratified=True, cv_seed=1337)#

Sets the validation mode to single split.

The mode will be set to either “SHUFFLE” or “TIME_SERIES_SINGLE_SPLIT”, depending on whether time-based ordering is enabled.

Parameters:

split_ratio (float) – The ratio of the data used for training during hyperparameter search (defaults to 0.8)
stratified (bool) – If True, keep the same proportion of each target classes in both splits (defaults to True)
cv_seed (int) – Seed for cross-validation (defaults to 1337)

set_custom_validation(code=None)#

Sets the validation mode to “CUSTOM”, and sets the custom validation code.

Your code must create a ‘cv’ variable. This ‘cv’ must be compatible with the scikit-learn ‘CV splitter class family.

Example splitter classes can be found here: https://scikit-learn.org/stable/modules/classes.html#splitter-classes

This example code uses the ‘repeated K-fold’ splitter of scikit-learn:

from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=3, n_repeats=5)

Parameters:: code (str) – definition of the validation

set_search_distribution(distributed=False, n_containers=4)#

Sets the distribution parameters for the hyperparameter search execution.

Parameters:

distributed (bool) – if True, distribute search in the Kubernetes cluster selected in the runtime environment’s containerized execution configuration (defaults to False)
n_containers (int) – number of containers to use for the distributed search (defaults to 4)

property distributed#

Returns:: Whether the search is set to distributed
Return type:: bool

property timeout#

Returns:: The search timeout
Return type:: int

property n_iter#

Returns:: The number of search iterations
Return type:: int

property parallelism#

Returns:: The number of threads used for the search
Return type:: int

class dataikuapi.dss.ml.DSSMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#

Object to read and modify the settings of an existing ML task.

Important

Do not create this class directly, use DSSMLTask.get_settings() instead

Usage example:

project_key = 'DKU_CHURN'
fmi = 'A-DKU_CHURN-RADgquHe-5nJtl88L-s1-pp1-m1'

client = dataiku.api_client()
task = dataikuapi.dss.ml.DSSMLTask.from_full_model_id(client, fmi, project_key)
task_settings = task.get_settings()
task_settings.set_diagnostics_enabled(False)
task_settings.save()

get_raw()#

Gets the raw settings of this ML Task.

This returns a reference to the raw settings, rather than a copy, so any changes made to the returned object will be reflected when saving.

Returns:: The raw settings of this ML Task
Return type:: dict

get_feature_preprocessing(feature_name)#

Gets the feature preprocessing parameters for a particular feature.

This returns a reference to the selected features’ settings, rather than a copy,: so any changes made to the returned object will be reflected when saving.

Parameters:: feature_name (str) – Name of the feature whose parameters will be returned
Returns:: A dict of the preprocessing settings for a feature
Return type:: dict

foreach_feature(fn, only_of_type=None)#

Applies a function to all features, including REJECTED features, except for the target feature

Parameters:

fn (function) – Function handle of the form fn(feature_name, feature_params) -> dict, where feature_name is the feature name as a str, and feature_params is a dict containing the specific feature params. The function should return a dict of edited parameters for the feature.
only_of_type (Union[str, None], optional) – If set, only applies the function to features matching the given type. Must be one of CATEGORY, NUMERIC, TEXT or VECTOR.

reject_feature(feature_name)#

Marks a feature as ‘rejected’, disabling it from being used as an input when training. This reverses the effect of the use_feature() method.

Parameters:: feature_name (str) – Name of the feature to reject

use_feature(feature_name)#

Marks a feature to be used (enabled) as an input when training. This reverses the effect of the reject_feature() method.

Parameters:: feature_name (str) – Name of the feature to use/enable

get_algorithm_settings(algorithm_name)#: Caution

Not Implemented, throws NotImplementedError

get_diagnostics_settings()#

Gets the ML Tasks diagnostics’ settings.

This returns a reference to the diagnostics’ settings, rather than a copy, so changes made to the returned object will be reflected when saving.

This method returns a dictionary of the settings with:

enabled (boolean): Indicates if the diagnostics are enabled globally, if False, all diagnostics will be disabled
settings (List[dict]): A list of dicts.
Each dict will contain the following:
- type (str): The diagnostic type name, in uppercase
- enabled (boolean): Indicates if the diagnostic type is enabled. If False, all diagnostics of that type will be disabled

Please refer to the documentation for details on available diagnostics.

Returns:: A dict of diagnostics settings
Return type:: dict

set_diagnostics_enabled(enabled)#

Globally enables or disables the calculation of all diagnostics

Parameters:: enabled (bool) – True if the diagnostics should be enabled, False otherwise

set_diagnostic_type_enabled(diagnostic_type, enabled)#

Enables or disables the calculation of a set of diagnostics given their type.

Attention

This is overridden by whether diagnostics are enabled globally; If diagnostics are disabled globally, nothing will be calculated.

Diagnostics can be enabled/disabled globally via the set_diagnostics_enabled() method.

Usage example:

mltask_settings = task.get_settings()
mltask_settings.set_diagnostics_enabled(True)
mltask_settings.set_diagnostic_type_enabled("ML_DIAGNOSTICS_DATASET_SANITY_CHECKS", False)
mltask_settings.set_diagnostic_type_enabled("ML_DIAGNOSTICS_LEAKAGE_DETECTION", False)
mltask_settings.save()

Please refer to the documentation for details on available diagnostics.

Parameters:

diagnostic_type (str) – Name of the diagnostic type, in uppercase.
enabled (bool) – True if the diagnostic should be enabled, False otherwise

set_algorithm_enabled(algorithm_name, enabled)#

Enables or disables an algorithm given its name.

Exact algorithm names can be found using the get_all_possible_algorithm_names() method.

Please refer to the documentation for further information on available algorithms.

Parameters:

algorithm_name (str) – Name of the algorithm, in uppercase.
enabled (bool) – True if the algorithm should be enabled, False otherwise

disable_all_algorithms()#: Disables all algorithms

get_all_possible_algorithm_names()#

Gets the list of possible algorithm names

This can be used to find the list of valid identifiers for the set_algorithm_enabled() and get_algorithm_settings() methods.

This includes all possible algorithms, regardless of the prediction kind (regression/classification etc) or engine, so some algorithms may be irrelevant to the current task.

Returns:: The list of algorithm names as a list of strings
Return type:: list[str]

get_enabled_algorithm_names()#

Gets the list of enabled algorithm names

Returns:: The list of enabled algorithm names
Return type:: list[str]

get_enabled_algorithm_settings()#

Gets the settings for each enabled algorithm

Returns a dictionary where:

Each key is the name of an enabled algorithm
Each value is the result of calling get_algorithm_settings() with the key as the parameter

Returns:: The dict of enabled algorithm names with their settings
Return type:: dict

set_metric(metric=None, custom_metric=None, custom_metric_greater_is_better=True, custom_metric_use_probas=False, custom_metric_name=None)#

Sets the score metric to optimize for a prediction ML Task

When using a custom optimisation metric, the metric parameter must be kept as None, and a string containing the metric code should be passed to the custom_metric parameter.

Parameters:

metric (str, optional) – Name of the metric to use. Must be left empty to use a custom metric (defaults to None).
custom_metric (str, optional) – Code for the custom optimisation metric (defaults to None)
custom_metric_greater_is_better (bool, optional) – Whether the custom metric function returns a score (True, default) or a loss (False). Score functions return higher values as the model improves, whereas loss functions return lower values.
custom_metric_use_probas (bool, optional) – If True, will use the classes’ probas or the predicted value (for classification) (defaults to False)
custom_metric_name (str, optional) – Name of your custom metric. If not set, it will generate one.

add_custom_python_model(name='Custom Python Model', code='')#

Adds a new custom python model and enables it.

Your code must create a ‘clf’ variable. This clf must be a scikit-learn compatible estimator, ie, it should:

have at least fit(X,y) and predict(X) methods
inherit sklearn.base.BaseEstimator
handle the attributes in the __init__ function
have a classes_ attribute (for classification tasks)
have a predict_proba method (optional)

Example:

mltask_settings = task.get_settings()
code = """
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=20)
"""
mltask_settings.add_custom_python_model(name="sklearn adaboost custom", code=code)
mltask_settings.save()

See: https://doc.dataiku.com/dss/latest/machine-learning/custom-models.html

Parameters:

name (str) – The name of the custom model (defaults to “Custom Python Model”)
code (str) – The code for the custom model (defaults to “”)

add_custom_mllib_model(name='Custom MLlib Model', code='')#

Adds a new custom MLlib model and enables it

This example has sample code that uses a standard MLlib algorithm, the RandomForestClassifier:

mltask_settings = task.get_settings()

code = """
// import the Estimator from spark.ml
import org.apache.spark.ml.classification.RandomForestClassifier

// instantiate the Estimator
new RandomForestClassifier()
  .setLabelCol("Survived")  // Must be the target column
  .setFeaturesCol("__dku_features")  // Must always be __dku_features
  .setPredictionCol("prediction")    // Must always be prediction
  .setNumTrees(50)
  .setMaxDepth(8)
"""
mltask_settings.add_custom_mllib_model(name="spark random forest custom", code=code)
mltask_settings.save()

Parameters:

name (str) – The name of the custom model (defaults to “Custom MLlib Model”)
code (str) – The code for the custom model (defaults to “”)

save()#: Saves the settings back to the ML Task

class dataikuapi.dss.ml.DSSPredictionMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#

class PredictionTypes#

Possible prediction types

BINARY = 'BINARY_CLASSIFICATION'#

REGRESSION = 'REGRESSION'#

MULTICLASS = 'MULTICLASS'#

OTHER = 'OTHER'#

get_all_possible_algorithm_names()#

Returns the list of possible algorithm names.

This includes the names of algorithms from installed plugins.

This can be used as the list of valid identifiers for set_algorithm_enabled() and get_algorithm_settings().

This includes all possible algorithms, regardless of the prediction kind (regression/classification) or the engine, so some algorithms may be irrelevant to the current task.

Returns:: The list of algorithm names
Return type:: list[str]

get_enabled_algorithm_names()#

Gets the list of enabled algorithm names

Returns:: The list of enabled algorithm names
Return type:: list[str]

get_algorithm_settings(algorithm_name)#

Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.

This method returns the settings for this algorithm as an PredictionAlgorithmSettings (extended dict). All algorithm dicts have at least an “enabled” property/key in the settings. The “enabled” property/key indicates whether this algorithm will be trained.

Other settings are algorithm-dependent and are the various hyperparameters of the algorithm. The precise properties/keys for each algorithm are not all documented. You can print the returned AlgorithmSettings to learn more about the settings of each particular algorithm.

Please refer to the documentation for details on available algorithms.

Parameters:: algorithm_name (str) – Name of the algorithm, in uppercase.
Returns:: A PredictionAlgorithmSettings (extended dict) for one of the built-in prediction algorithms
Return type:: PredictionAlgorithmSettings

split_ordered_by(feature_name, ascending=True)#: Deprecated. Use split_params.set_time_ordering()

remove_ordered_split()#: Deprecated. Use split_params.unset_time_ordering()

use_sample_weighting(feature_name)#: Deprecated. use set_weighting()

set_weighting(method, feature_name=None)#

Sets the method for weighting samples.

If there was a WEIGHT feature declared previously, it will be set back as an INPUT feature first.

Parameters:

method (str) – Weighting nethod to use. One of NO_WEIGHTING, SAMPLE_WEIGHT (requires a feature name), CLASS_WEIGHT or CLASS_AND_SAMPLE_WEIGHT (requires a feature name)
feature_name (str, optional) – Name of the feature to use as sample weight

remove_sample_weighting()#: Deprecated. Use set_weighting(method=”NO_WEIGHTING”) instead

get_assertions_params()#

Retrieves the ML Task assertion parameters

Returns:: The assertions parameters for this ML task
Return type:: DSSMLAssertionsParams

get_hyperparameter_search_settings()#

Gets the hyperparameter search parameters of the current DSSPredictionMLTaskSettings instance as a HyperparameterSearchSettings object. This object can be used to both get and set properties relevant to hyperparameter search, such as search strategy, cross-validation method, execution limits and parallelism.

Returns:: A HyperparameterSearchSettings
Return type:: HyperparameterSearchSettings

get_prediction_type()#

get_split_params()#

Gets a handle to modify train/test splitting params.

Return type:: PredictionSplitParamsHandler

property split_params#: Deprecated. Use get_split_params()

class dataikuapi.dss.ml.DSSClusteringMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#

get_algorithm_settings(algorithm_name)#

Gets the training settings of a particular algorithm.

This returns a reference to the algorithm’s settings, rather than a copy, so any changes made to the returned object will be reflected when saving.

This method returns a dictionary of the settings for this algorithm. All algorithm dicts contain an “enabled” key, which indicates whether this algorithm will be trained

Other settings are algorithm-dependent and include the various hyperparameters of the algorithm. The precise keys for each algorithm are not all documented. You can print the returned dictionary to learn more about the settings of each particular algorithm.

Please refer to the documentation for details on available algorithms.

Parameters:: algorithm_name (str) – Name of the algorithm, in uppercase.
Returns:: A dict containing the settings for the algorithm
Return type:: dict

class dataikuapi.dss.ml.DSSTimeseriesForecastingMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#

class PredictionTypes#

Possible prediction types

TIMESERIES_FORECAST = 'TIMESERIES_FORECAST'#

get_time_step_params()#

Gets the time step parameters for the time series forecasting task.

This returns a reference to the time step parameters, rather than a copy, so any changes made to the returned object will be reflected when saving.

Returns:: A dict of the time step parameters
Return type:: dict

set_time_step(time_unit=None, n_time_units=None, end_of_week_day=None, reguess=True, update_algorithm_settings=True, unit_alignment=None, monthly_alignment=None)#

Sets the time step parameters for the time series forecasting task.

Parameters:

time_unit (str, optional) – time unit for forecasting step. Valid values are: MILLISECOND, SECOND, MINUTE, HOUR, DAY, BUSINESS_DAY, WEEK, MONTH, QUARTER, HALF_YEAR, YEAR (defaults to None, i.e. don’t change)
n_time_units (int, optional) – number of time units within a time step (defaults to None, i.e. don’t change)
end_of_week_day (int, optional) – only useful for the WEEK time unit. Valid values are: 1 (Sunday), 2 (Monday), …, 7 (Saturday) (defaults to None, i.e. don’t change)
reguess (bool) – Whether to reguess the ML task settings after changing the time step params (defaults to True)
update_algorithm_settings (bool) – Whether the algorithm settings should also be reguessed if reguessing the ML Task (defaults to True)
unit_alignment (int, optional) – for each step, month index (between 1 and 3) when time_unit is QUARTER, month index (between 1 and 6) when time_unit is HALF_YEAR, month number (between 1 and 12) when time_unit is YEAR (defaults to None, i.e. don’t change)
monthly_alignment (int, optional) – for each step, day of the month (between 1 and 31) when time_unit is MONTH, QUARTER, HALF_YEAR or YEAR (defaults to None, i.e. don’t change)

get_resampling_params()#

Gets the time series resampling parameters for the time series forecasting task.

This returns a reference to the time series resampling parameters, rather than a copy, so any changes made to the returned object will be reflected when saving.

Returns:: A dict of the resampling parameters
Return type:: dict

set_numerical_interpolation(method=None, constant=None)#

Sets the time series resampling numerical interpolation parameters

Parameters:

method (str, optional) – Interpolation method. Valid values are: NEAREST, PREVIOUS, NEXT, LINEAR, QUADRATIC, CUBIC, CONSTANT, STAIRCASE (defaults to None, i.e. don’t change)
constant (float, optional) – Value for the CONSTANT interpolation method (defaults to None, i.e. don’t change)

set_numerical_extrapolation(method=None, constant=None)#

Sets the time series resampling numerical extrapolation parameters

Parameters:

method (str, optional) – Extrapolation method. Valid values are: PREVIOUS_NEXT, NO_EXTRAPOLATION, CONSTANT, LINEAR, QUADRATIC, CUBIC (defaults to None, i.e. don’t change)
constant (float, optional) – Value for the CONSTANT extrapolation method (defaults to None)

set_resampling_dates(start_date_mode='AUTO', custom_start_date=None, end_date_mode='AUTO', custom_end_date=None)#

Sets dates to use for resampling the time series

Parameters:

start_date_mode (string, optional) – either “AUTO” to use the oldest known timestamp or “CUSTOM” to use the date set with custom_start_date (defaults to AUTO)
custom_start_date (datetime.date, optional) – start date to use for resampling (defaults to None)
end_date_mode (string, optional) – either “AUTO” to use the newest known timestamp or “CUSTOM” to use the date set with custom_end_date (defaults to AUTO)
custom_end_date (datetime.date, optional) – end date to use for resampling (defaults to None)

set_categorical_imputation(method=None, constant=None)#

Sets the time series resampling categorical imputation parameters

Parameters:

method (str, optional) – Imputation method. Valid values are: MOST_COMMON, NULL, CONSTANT, PREVIOUS_NEXT, PREVIOUS, NEXT (defaults to None, i.e. don’t change)
constant (str, optional) – Value for the CONSTANT imputation method (defaults to None, i.e. don’t change)

set_duplicate_timestamp_handling(method)#

Sets the time series duplicate timestamp handling method

Parameters:: method (str) – Duplicate timestamp handling method. Valid values are: FAIL_IF_CONFLICTING, DROP_IF_CONFLICTING, MEAN_MODE.

property forecast_horizon#

Returns:: Number of time steps to be forecast
Return type:: int

set_forecast_horizon(forecast_horizon, reguess=True, update_algorithm_settings=True, validation_horizons=None)#

Sets the time series forecast horizon

Parameters:

forecast_horizon (int) – Number of time steps to be forecast
reguess (bool) – Whether to reguess the ML task settings after changing the forecast horizon (defaults to True)
update_algorithm_settings (bool) – Whether the algorithm settings should be reguessed after the forecast horizon (defaults to True)
validation_horizons (int|None) – The number of validation horizons to be set. If omitted, retains the previous ratio.

property evaluation_gap#

Returns:: Number of skipped time steps for evaluation
Return type:: int

property time_variable#

Returns:: Feature used as time variable (read-only)
Return type:: str

property timeseries_identifiers#

Returns:: Features used as time series identifiers (read-only copy)
Return type:: list

property quantiles_to_forecast#

Returns:: List of quantiles to forecast
Return type:: list

property skip_too_short_timeseries_for_training#

Returns:: Whether we skip too short time series during training, or fail the whole training when only one time series is too short.
Return type:: bool

get_algorithm_settings(algorithm_name)#

Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.

Please refer to the documentation for details on available algorithms.

Parameters:: algorithm_name (str) – Name of the algorithm, in uppercase.
Returns:: A PredictionAlgorithmSettings (extended dict) for one of the built-in prediction algorithms
Return type:: PredictionAlgorithmSettings

get_assertions_params()#

Retrieves the assertions parameters for this ml task

Return type:: DSSMLAssertionsParams

get_hyperparameter_search_settings()#

Returns:: A HyperparameterSearchSettings
Return type:: HyperparameterSearchSettings

get_prediction_type()#

get_split_params()#

Gets a handle to modify train/test splitting params.

Return type:: PredictionSplitParamsHandler

property split_params#: Deprecated. Use get_split_params()

class dataikuapi.dss.ml.PredictionSplitParamsHandler(mltask_settings)#

Object to modify the train/test dataset splitting params.

Important

Do not create this class directly, use DSSMLTaskSettings.get_split_params()

SPLIT_PARAMS_KEY = 'splitParams'#

get_raw()#

Gets the raw settings of the prediction split configuration.

This returns a reference to the raw settings, rather than a copy, so any changes made to the returned object will be reflected when saving.

Returns:: The raw prediction split parameter settings
Return type:: dict

set_split_random(train_ratio=0.8, selection=None, dataset_name=None)#

Sets the train/test split mode to random splitting over an extract from a single dataset

Parameters:

train_ratio (float) – Ratio of rows to use for the train set. Must be between 0 and 1 (defaults to 0.8)
selection (Union[dataikuapi.dss.utils.DSSDatasetSelectionBuilder, dict], optional) – Optional builder or dict defining the settings of the extract from the dataset (defaults to None). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
dataset_name (str, optional) – Name of the dataset to split on. If None (default), uses the main dataset used to create the visual analysis

set_split_kfold(n_folds=5, selection=None, dataset_name=None)#

Sets the train/test split mode to k-fold splitting over an extract from a single dataset

Parameters:

n_folds (int) – number of folds. Must be greater than 0 (defaults to 5)
selection (Union[DSSDatasetSelectionBuilder, dict], optional) – Optional builder or dict defining the settings of the extract from the dataset (defaults to None) A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
dataset_name (str, optional) – Name of the dataset to split on. If None (default), uses the main dataset used to create the visual analysis

set_split_explicit(train_selection, test_selection, dataset_name=None, test_dataset_name=None, train_filter=None, test_filter=None)#

Sets the train/test split to an explicit extract from one or two dataset(s)

Parameters:

train_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the train dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
test_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the test dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
dataset_name (str, optional) – Name of the dataset to split on. If None (default), uses the main dataset used to create the visual analysis
test_dataset_name (str, optional) – Optional name of a second dataset to use for the test data extract. If None (default), both extracts are done from dataset_name
train_filter (Union[DSSFilterBuilder, dict], optional) – Builder or dict defining the settings of the filter for the train dataset. Defaults to None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSFilterBuilder.build()
test_filter (Union[DSSFilterBuilder, dict], optional) – Builder or dict defining the settings of the filter for the test dataset. Defaults to None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSFilterBuilder.build()

set_time_ordering(feature_name, ascending=True)#

Enables time based ordering and sets the feature upon which to sort the train/test split and hyperparameter optimization data by time.

Parameters:

feature_name (str) – The name of the feature column to use. This feature must be present in the output of the preparation steps of the analysis. When there are no preparation steps, it means this feature must be present in the analyzed dataset.
ascending (bool) – True (default) means the test set is expected to have larger time values than the train set

unset_time_ordering()#: Disables time-based ordering for train/test split and hyperparameter optimization

has_time_ordering()#

Returns:: True if the split uses time-based ordering
Return type:: bool

get_time_ordering_variable()#

Returns:: If enabled, the name of the ordering variable for time based ordering (the feature name). Returns None if time based ordering is not enabled.
Return type:: Union[str, None]

is_time_ordering_ascending()#

Returns:: True if the time-ordering is set to sort in ascending order. Returns None if time based ordering is not enabled.
Return type:: Union[bool, None]

Exploration of results#

class dataikuapi.dss.ml.DSSTrainedPredictionModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)#

Object to read details of a trained prediction model

Important

Do not create this object directly, use DSSMLTask.get_trained_model_details() instead

get_roc_curve_data()#

Gets the data used to plot the ROC curve for the model, if it exists.

Returns:: A dictionary containing ROC curve data

get_performance_metrics()#

Returns all performance metrics for this model.

For binary classification model, this includes both “threshold-independent” metrics like AUC and “threshold-dependent” metrics like precision. Threshold-dependent metrics are returned at the threshold value that was found to be optimal during training.

To get access to the per-threshold values, use the following:

# Returns a list of tested threshold values
details.get_performance()["perCutData"]["cut"]
# Returns a list of F1 scores at the tested threshold values
details.get_performance()["perCutData"]["f1"]
# Both lists have the same length

If K-Fold cross-test was used, most metrics will have a “std” variant, which is the standard deviation accross the K cross-tested folds. For example, “auc” will be accompanied with “aucstd”

Returns:: a dict of performance metrics values
Return type:: dict

get_assertions_metrics()#

Retrieves assertions metrics computed for this trained model

Returns:: an object representing assertion metrics
Return type:: DSSMLAssertionsMetrics

get_hyperparameter_search_points()#

Gets the list of points in the hyperparameter search space that have been tested.

Returns a list of dict. Each entry in the list represents a point.

For each point, the dict contains at least:

“score”: the average value of the optimization metric over all the folds at this point
“params”: a dict of the parameters at this point. This dict has the same structure as the params of the best parameters

get_preprocessing_settings()#

Gets the preprocessing settings that were used to train this model

Return type:: dict

get_modeling_settings()#

Gets the modeling (algorithms) settings that were used to train this model.

Note

The structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithms).

Return type:: dict

get_actual_modeling_params()#

Gets the actual / resolved parameters that were used to train this model, post hyperparameter optimization.

Returns:: A dictionary, which contains at least a “resolved” key, which is a dict containing the post-optimization parameters
Return type:: dict

get_trees()#

Gets the trees in the model (for tree-based models)

Returns:: a DSSTreeSet object to interact with the trees
Return type:: dataikuapi.dss.ml.DSSTreeSet

get_coefficient_paths()#

Gets the coefficient paths for Lasso models

Returns:: a DSSCoefficientPaths object to interact with the coefficient paths
Return type:: dataikuapi.dss.ml.DSSCoefficientPaths

get_scoring_jar_stream(model_class='model.Model', include_libs=False)#

Returns a stream of a scoring jar for this trained model.

This works provided that you have the license to do so and that the model is compatible with optimized scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Parameters:

model_class (str) – fully-qualified class name, e.g. “com.company.project.Model”
include_libs (bool) – if True, also packs the required dependencies; if False, runtime will require the scoring libs given by DSSClient.scoring_libs()

Returns:

a jar file, as a stream

Return type:

file-like

get_scoring_pmml_stream()#

Returns a stream of a scoring PMML for this trained model.

This works provided that you have the license to do so and that the model is compatible with PMML scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Returns:: a PMML file, as a stream
Return type:: file-like

get_scoring_python_stream()#

Returns a stream of a zip file containing the required data to use this trained model in external python code.

See: https://doc.dataiku.com/dss/latest/python-api/ml.html

This works provided that you have the license to do so and that the model is compatible with Python scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Returns:: an archive file, as a stream
Return type:: file-like

get_scoring_python(filename)#

Downloads a zip file containing the required data to use this trained model in external python code.

See: https://doc.dataiku.com/dss/latest/python-api/ml.html

This works provided that you have the license to do so and that the model is compatible with Python scoring.

Parameters:: filename (str) – filename of the resulting downloaded file

get_scoring_mlflow_stream(use_original_model=False)#

Returns a stream of a zip containing this trained model using the MLflow Model format.

This works provided that you have the license to do so and that the model is compatible with MLflow scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Parameters:: use_original_model (bool, optional) – Works only if the model was originally imported from MLflow. Set to True if you want to get the original MLflow model and to False if you want DSS to regenerate a new MLflow model. Defaults to False.
Returns:: an archive file, as a stream
Return type:: file-like

get_scoring_mlflow(filename, use_original_model=False)#

Downloads a zip containing data for this trained model, using the MLflow Model format.

This works provided that you have the license to do so and that the model is compatible with MLflow scoring.

Parameters:

filename (str) – filename to the resulting MLflow Model zip
use_original_model (bool, optional) – Works only if the model was originally imported from MLflow. Set to True if you want to get the original MLflow model and to False if you want DSS to regenerate a new MLflow model. Defaults to False.

export_to_snowflake_function(connection_name, function_name, wait=True)#: Exports the model to a Snowflake function. Only works for Saved Model Versions. :param connection_name: Snowflake connection to use :param function_name: Name of the function to create :param wait: a flag to wait for the operation to complete (defaults to True) :return: None if wait is True, else a future

export_to_databricks_registry(connection_name, use_unity_catalog, model_name, experiment_name, wait=True)#

Exports the model as a version of a Registered Model of a Databricks Registry. To do so, the model is exported to the MLflow format, then logged: in a run of an experiment, and finally registered in the selected registry.

Parameters:

connection_name – Databricks Model Deployment Infrastructure connection to use
use_unity_catalog – exports to a model in the Databricks Workspace registry or in the Databricks Unity Catalog
model_name – name of the model to add a version to. Restrictions apply on possible name ; please refer to Databricks documentation. The model will be created if needed.
experiment_name – name of the experiment to use. The experiment will be created if needed.
wait – a flag to wait for the operation to complete (defaults to True)

Returns:

dict if wait is True, else a future

compute_shapley_feature_importance()#

Launches computation of Shapley feature importance for this trained model

Returns:: A future for the computation task
Return type:: dataikuapi.dss.future.DSSFuture

compute_subpopulation_analyses(split_by, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)#

Launch computation of Subpopulation analyses for this trained model.

Parameters:

split_by (list|str) – column(s) on which subpopulation analyses are to be computed (one analysis per column)
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training.
debug_mode (bool) – if True, output all logs (slower)

Returns:

if wait is True, an object containing the Subpopulation analyses, else a future to wait on the result

Return type:

Union[dataikuapi.dss.ml.DSSSubpopulationAnalyses, dataikuapi.dss.future.DSSFuture]

get_subpopulation_analyses()#

Retrieve all subpopulation analyses computed for this trained model

Returns:: The subpopulation analyses
Return type:: dataikuapi.dss.ml.DSSSubpopulationAnalyses

compute_partial_dependencies(features, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)#

Launch computation of Partial dependencies for this trained model.

Parameters:

features (list|str) – feature(s) on which partial dependencies are to be computed
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training.
debug_mode (bool) – if True, output all logs (slower)

Returns:

if wait is True, an object containing the Partial dependencies, else a future to wait on the result

Return type:

Union[dataikuapi.dss.ml.DSSPartialDependencies, dataikuapi.dss.future.DSSFuture]

get_partial_dependencies()#

Retrieve all partial dependencies computed for this trained model

Returns:: The partial dependencies
Return type:: dataikuapi.dss.ml.DSSPartialDependencies

download_documentation_stream(export_id)#

Download a model documentation, as a binary stream.

Warning: this stream will monopolize the DSSClient until closed.

Parameters:: export_id – the id of the generated model documentation returned as the result of the future
Returns:: A DSSFuture representing the model document generation process

download_documentation_to_file(export_id, path)#

Download a model documentation into the given output file.

Parameters:

export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation

Returns:

None

property full_id#

generate_documentation(folder_id=None, path=None)#

Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.

Parameters:

folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder

Returns:

A DSSFuture representing the model document generation process

generate_documentation_from_custom_template(fp)#

Start the model document generation from a docx template (as a file object).

Parameters:: fp (object) – A file-like object pointing to a template docx file
Returns:: A DSSFuture representing the model document generation process

get_diagnostics()#

Retrieves diagnostics computed for this trained model

Returns:: list of diagnostics
Return type:: list of type dataikuapi.dss.ml.DSSMLDiagnostic

get_origin_analysis_trained_model()#

Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.

Return type:: DSSTrainedModelDetails | None

get_raw()#: Gets the raw dictionary of trained model details

get_raw_snippet()#: Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.

get_train_info()#

Returns various information about the training process (size of the train set, quick description, timing information)

Return type:: dict

get_user_meta()#: Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling save_user_meta()

save_user_meta()#

class dataikuapi.dss.ml.DSSTrainedClusteringModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)#

Object to read details of a trained clustering model

Important

Do not create this class directly, use DSSMLTask.get_trained_model_details()

get_raw()#

Gets the raw dictionary of trained model details

Returns:: A dictionary containing the trained model details
Return type:: dict

get_train_info()#

Gets various information about the training process.

This includes information such as the size of the train set, the quick description and timing information etc.

Returns:: A dictionary containing the models training information
Return type:: dict

get_facts()#

Gets the ‘cluster facts’ data.

The cluster facts data is the structure behind the screen “for cluster X, average of Y is Z times higher than average”.

Returns:: The clustering facts data
Return type:: DSSClustersFacts

get_performance_metrics()#

Returns all performance metrics for this clustering model.

Returns:: A dict of performance metrics values
Return type:: dict

download_documentation_stream(export_id)#

Download a model documentation, as a binary stream.

Warning: this stream will monopolize the DSSClient until closed.

Parameters:: export_id – the id of the generated model documentation returned as the result of the future
Returns:: A DSSFuture representing the model document generation process

download_documentation_to_file(export_id, path)#

Download a model documentation into the given output file.

Parameters:

export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation

Returns:

None

property full_id#

generate_documentation(folder_id=None, path=None)#

Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.

Parameters:

folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder

Returns:

A DSSFuture representing the model document generation process

generate_documentation_from_custom_template(fp)#

Start the model document generation from a docx template (as a file object).

Parameters:: fp (object) – A file-like object pointing to a template docx file
Returns:: A DSSFuture representing the model document generation process

get_diagnostics()#

Retrieves diagnostics computed for this trained model

Returns:: list of diagnostics
Return type:: list of type dataikuapi.dss.ml.DSSMLDiagnostic

get_origin_analysis_trained_model()#

Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.

Return type:: DSSTrainedModelDetails | None

get_preprocessing_settings()#

Gets the preprocessing settings that were used to train this model

Returns:: The model preprocessing settings
Return type:: dict

get_raw_snippet()#: Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.

get_user_meta()#: Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling save_user_meta()

save_user_meta()#

get_modeling_settings()#

Gets the modeling (algorithms) settings that were used to train this model.

Note

The structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithms).

Returns:: The model modeling settings
Return type:: dict

get_actual_modeling_params()#

Gets the actual / resolved parameters that were used to train this model.

Returns:: A dictionary, which contains at least a “resolved” key
Return type:: dict

get_scatter_plots()#

Gets the cluster scatter plot data

Returns:: a DSSScatterPlots object to interact with the scatter plots
Return type:: dataikuapi.dss.ml.DSSScatterPlots

class dataikuapi.dss.ml.DSSTrainedTimeseriesForecastingModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)#

Object to read details of a timeseries forecasting model, for instance the per time series metrics

Important

Do not create this object directly, use DSSMLTask.get_trained_model_details() instead

compute_residuals(wait=True)#

Launch computation of residuals for this trained timeseries model.

Parameters:: wait – a flag to wait for the operation to complete (defaults to True)
Returns:: if wait is True, a dictionary containing the residuals per-timeseries, else a future to wait on the result
Return type:: Union[dict, dataikuapi.dss.future.DSSFuture]

get_residuals()#

Retrieve a list of residuals for this trained time-series models

Returns:: A dictionary, which contains a residuals object per-timeseries
Return type:: dict

get_per_timeseries_metrics()#

Returns per timeseries performance metrics for this model.

Returns:: a dict of performance metrics values
Return type:: dict

get_per_timeseries_evaluation_forecasts()#

Returns per timeseries evaluation forecasts for this model.

Returns:: a dict of evaluation forecasts per timeseries identifier
Return type:: dict

Saved models#

class dataikuapi.dss.savedmodel.DSSSavedModel(client, project_key, sm_id)#

Handle to interact with a saved model on the DSS instance.

Important

Do not create this class directly, instead use dataikuapi.dss.DSSProject.get_saved_model()

Parameters:

client (dataikuapi.dssclient.DSSClient) – an api client to connect to the DSS backend
project_key (str) – identifier of the project containing the model
sm_id (str) – identifier of the saved model

property id#

Returns the identifier of the saved model

Return type:: str

get_settings()#

Returns the settings of this saved model.

Returns:: settings of this saved model
Return type:: dataikuapi.dss.savedmodel.DSSSavedModelSettings

list_versions()#

Gets the versions of this saved model.

This returns each version as a dict of object. Each object contains at least an “id” parameter, which can be passed to get_metric_values(), get_version_details() and set_active_version().

Returns:: The list of the versions
Return type:: list[dict]

get_active_version()#

Gets the active version of this saved model.

The returned dict contains at least an “id” parameter, which can be passed to get_metric_values(), get_version_details() and set_active_version().

Returns:: A dict representing the active version or None if no version is active.
Return type:: Union[dict, None]

get_version_details(version_id)#

Gets details for a version of a saved model

Parameters:: version_id (str) – identifier of the version, as returned by list_versions()
Returns:: details of this trained model
Return type:: dataikuapi.dss.ml.DSSTrainedPredictionModelDetails

set_active_version(version_id)#

Sets a particular version of the saved model as the active one.

Parameters:: version_id (str) – Identifier of the version, as returned by list_versions()

delete_versions(versions, remove_intermediate=True)#

Deletes version(s) of the saved model.

Parameters:

versions (list[str]) – list of versions to delete
remove_intermediate (bool) – If True, also removes intermediate versions. In the case of a partitioned model, an intermediate version is created every time a partition has finished training. (defaults to True)

get_origin_ml_task()#

Fetches the last ML task that has been exported to this saved model.

Returns:: origin ML task or None if the saved model does not have an origin ml task
Return type:: Union[dataikuapi.dss.ml.DSSMLTask, None]

import_mlflow_version_from_path(version_id, path, code_env_name='LOCAL-CODE-ENV', container_exec_config_name='LOCAL-CONFIG', set_active=True, binary_classification_threshold=0.5)#

Creates a new version for this saved model from a path containing a MLFlow model.

Important

Requires the saved model to have been created using dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model().

Parameters:

version_id (str) – identifier of the version, as returned by list_versions()
path (str) – absolute path on the local filesystem - must be a folder, and must contain a MLFlow model
code_env_name (str) –
Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks. * If value is “LOCAL-CODE-ENV”, the active code env will be used. If no env is found, fallback to “INHERIT”. * If value is “INHERIT”, the default active code env of the project will be used.

(defaults to LOCAL-CODE-ENV)
container_exec_config_name (str) –
Name of the containerized execution configuration to use for reading the metadata of the model * If value is “LOCAL-CONFIG”, set the variable to the active containerized configuration name. If running locally, set it to “NONE”. If the config name can’t be determined, fallback to “INHERIT”. * If value is “INHERIT”, the container execution configuration of the project will be used. * If value is “NONE”, local execution will be used (no container)

(defaults to LOCAL-CONFIG)
set_active (bool) – sets this new version as the active version of the saved model (defaults to True)
binary_classification_threshold (float) – for binary classification, defines the actual threshold for the imported version (defaults to 0.5)

Returns:

external model version handler in order to interact with the new MLFlow model version

Return type:

dataikuapi.dss.savedmodel.ExternalModelVersionHandler

import_mlflow_version_from_managed_folder(version_id, managed_folder, path, code_env_name='LOCAL-CODE-ENV', container_exec_config_name='LOCAL-CONFIG', set_active=True, binary_classification_threshold=0.5)#

Creates a new version for this saved model from a managed folder.

Important

Requires the saved model to have been created using dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model().

Parameters:

version_id (str) – identifier of the version, as returned by list_versions()
managed_folder (dataikuapi.dss.managedfolder.DSSManagedFolder or str) – managed folder, or identifier of the managed folder
path (str) – path of the MLflow folder in the managed folder
code_env_name (str) –
Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks. * If value is “LOCAL-CODE-ENV”, the active code env will be used. If no env is found, fallback to “INHERIT”. * If value is “INHERIT”, the default active code env of the project will be used.

(defaults to LOCAL-CODE-ENV)
container_exec_config_name (str) –
Name of the containerized execution configuration to use for reading the metadata of the model * If value is “LOCAL-CONFIG”, set the variable to the active containerized configuration name. If running locally, set it to “NONE”. If the config name can’t be determined, fallback to “INHERIT”. * If value is “INHERIT”, the container execution configuration of the project will be used. * If value is “NONE”, local execution will be used (no container)

(defaults to LOCAL-CONFIG)
set_active (bool) – sets this new version as the active version of the saved model (defaults to True)
binary_classification_threshold (float) – for binary classification, defines the actual threshold for the imported version (defaults to 0.5)

Returns:

external model version handler in order to interact with the new MLFlow model version

Return type:

dataikuapi.dss.savedmodel.ExternalModelVersionHandler

import_mlflow_version_from_databricks(version_id, connection_name, use_unity_catalog, model_name, model_version, code_env_name='LOCAL-CODE-ENV', container_exec_config_name='LOCAL-CONFIG', set_active=True, binary_classification_threshold=0.5)#

create_external_model_version(version_id, configuration, target_column_name=None, class_labels=None, set_active=True, binary_classification_threshold=0.5, input_dataset=None, selection=None, use_optimal_threshold=True, skip_expensive_reports=True, features_list=None, container_exec_config_name='LOCAL-CONFIG', input_format='GUESS', output_format='GUESS', evaluate=True)#

Creates a new version of an external model.

Important

Requires the saved model to have been created using dataikuapi.dss.project.DSSProject.create_external_model().

Parameters:

version_id (str) – Identifier of the version, as returned by list_versions()

configuration (dict) –

A dictionary containing the desired saved model version configuration.

For SageMaker, syntax is:

configuration = {
    "protocol": "sagemaker",
    "endpoint_name": "<endpoint-name>"
}

For AzureML, syntax is:

configuration = {
    "protocol": "azure-ml",
    "endpoint_name": "<endpoint-name>"
}

For Vertex AI, syntax is:

configuration = {
    "protocol": "vertex-ai",
    "endpoint_id": "<endpoint-id>"
}

For Databricks, syntax is:

configuration = {
    "protocol": "databricks",
    "endpointName": "<endpoint-id>"
}

target_column_name (str) – Name of the target column. Mandatory if model performance will be evaluated
class_labels (list or None) – List of strings, ordered class labels. Mandatory for evaluation of classification models
set_active (bool) – (optional) Sets this new version as the active version of the saved model (defaults to True)
binary_classification_threshold (float) – (optional) For binary classification, defines the actual threshold for the imported version (defaults to 0.5). Overwritten during evaluation if an evaluation dataset is specified and use_optimal_threshold is True
input_dataset (str or dataikuapi.dss.dataset.DSSDataset or dataiku.Dataset) – (mandatory if either evaluate=True, input_format=GUESS, output_format=GUESS, features_list is None) Dataset to use to infer the features names and types (if features_list is not set), evaluate the model, populate interpretation tabs, and guess input/output formats (if input_format=GUESS or output_format=GUESS).
selection (dict or DSSDatasetSelectionBuilder or None) –
(optional) Sampling parameter for input_dataset during evaluation.
- Example 1: head 100 lines DSSDatasetSelectionBuilder().with_head_sampling(100)
- Example 2: random 500 lines DSSDatasetSelectionBuilder().with_random_fixed_nb_sampling(500)
- Example 3: head 100 lines {"samplingMethod": "HEAD_SEQUENTIAL", "maxRecords": 100}
Defaults to head 100 lines
use_optimal_threshold (bool) – (optional) Set as threshold for this model version the threshold that has been computed during evaluation according to the metric set on the saved model setting (i.e. prediction_metrics_settings['thresholdOptimizationMetric'])
skip_expensive_reports (bool) – (optional) Skip computation of expensive/slow reports (e.g. feature importance).
features_list (list of {"name": "feature_name", "type": "feature_type"} or None) – (optional) List of features, in JSON. Used if input_dataset is not defined
container_exec_config_name (str) – (optional) name of the containerized execution configuration to use for running the evaluation process. * If value is “LOCAL-CONFIG”, set the variable to the active containerized configuration name. If running locally, set it to “NONE”. If the config name can’t be determined, fallback to “INHERIT”. * If value is “INHERIT”, the container execution configuration of the project will be used. * If value is “NONE”, local execution will be used (no container) (defaults to LOCAL-CONFIG)
input_format (str) –
(optional) Input format to use when querying the underlying endpoint. For the ‘azure-ml’ and ‘sagemaker’ protocols, this option must be set if input_dataset is not set. Supported values are:
- For all protocols:
  - GUESS (default): Guess the input format by cycling through supported input formats and making requests using data from input_dataset.
- For Amazon SageMaker:
  - INPUT_SAGEMAKER_CSV
  - INPUT_SAGEMAKER_JSON
  - INPUT_SAGEMAKER_JSON_EXTENDED
  - INPUT_SAGEMAKER_JSONLINES
  - INPUT_DEPLOY_ANYWHERE_ROW_ORIENTED_JSON
- For Vertex AI:
  - INPUT_VERTEX_DEFAULT
- For Azure Machine Learning:
  - INPUT_AZUREML_JSON_INPUTDATA
  - INPUT_AZUREML_JSON_WRITER
  - INPUT_AZUREML_JSON_INPUTDATA_DATA
  - INPUT_DEPLOY_ANYWHERE_ROW_ORIENTED_JSON
- For Databricks:
  - INPUT_RECORD_ORIENTED_JSON
  - INPUT_SPLIT_ORIENTED_JSON
  - INPUT_TF_INPUTS_JSON
  - INPUT_TF_INSTANCES_JSON
  - INPUT_DATABRICKS_CSV
output_format (str) –
(optional) Output format to use to parse the underlying endpoint’s response. For the ‘azure-ml’ and ‘sagemaker’ protocols, this option must be set if input_dataset is not set. Supported values are:
- For all protocols:
  - GUESS (default): Guess the output format by cycling through supported output formats and making requests using data from input_dataset.
- For Amazon SageMaker:
  - OUTPUT_SAGEMAKER_CSV
  - OUTPUT_SAGEMAKER_ARRAY_AS_STRING
  - OUTPUT_SAGEMAKER_JSON
  - OUTPUT_DEPLOY_ANYWHERE_JSON
- For Vertex AI:
  - OUTPUT_VERTEX_DEFAULT
- For Azure Machine Learning:
  - OUTPUT_AZUREML_JSON_OBJECT
  - OUTPUT_AZUREML_JSON_ARRAY
  - OUTPUT_DEPLOY_ANYWHERE_JSON
- For Databricks:
  - OUTPUT_DATABRICKS_JSON
evaluate (bool) – (optional) True (default) if this model should be evaluated using input_dataset, False to disable evaluation.

Example: create a SageMaker Saved Model and add an endpoint as a version, evaluated on a dataset:

import dataiku
client = dataiku.api_client()
project = client.get_default_project()
# create a SageMaker saved model, whose endpoints are hosted in region eu-west-1
sm = project.create_external_model("SaveMaker External Model", "BINARY_CLASSIFICATION", {"protocol": "sagemaker", "region": "eu-west-1"})

# configuration to add endpoint
configuration = {
  "protocol": "sagemaker",
  "endpoint_name": "titanic-survived-endpoint"
}
smv = sm.create_external_model_version("v0",
                                  configuration,
                                  target_column_name="Survived",
                                  class_labels=["0", "1"],
                                  input_dataset="evaluation_dataset")

A dataset named “evaluation_dataset” must exist in the current project. Its schema and content should match the endpoint expectations. Depending on the way the model deployed on the endpoint was created, it may require a certain schema and not accept extra columns, it may not deal with missing features, etc.

Example: create a Vertex AI Saved Model and add an endpoint as a version, without evaluating it:

import dataiku
client = dataiku.api_client()
project = client.get_default_project()
# create a VertexAI saved model, whose endpoints are hosted in region europe-west-1
sm = project.create_external_model("Vertex AI Proxy Model", "BINARY_CLASSIFICATION", {"protocol":"vertex-ai", "region":"europe-west1"})
configuration = {
    "protocol":"vertex-ai",
    "project_id": "my-project",
    "endpoint_id": "123456789012345678"
}

smv = sm.create_external_model_version("v1",
                                    configuration,
                                    target_column_name="Survived",
                                    class_labels=["0", "1"],
                                    input_dataset="titanic")

A dataset named “my_dataset” must exist in the current project. It will be used to infer the schema of the data to submit to the endpoint. As there is no evaluation dataset specified, the interpretation tabs of this model version will be for the most empty. But the model still can be used to score datasets. It can also be evaluated on a dataset by an Evaluation Recipe.

Example: create an AzureML Saved Model

import dataiku
client = dataiku.api_client()
project = client.get_default_project()
# create an Azure ML saved model. No region specified, as this notion does not exist for Azure ML
sm = project.create_external_model("Azure ML Proxy Model", "BINARY_CLASSIFICATION", {"protocol": "azure-ml"})
configuration = {
    "protocol": "azure-ml",
    "subscription_id": "<subscription-id>>",
    "resource_group": "<your.resource.group-rg>",
    "workspace": "<your-workspace>",
    "endpoint_name": "<endpoint-name>"
}

features_list = [{'name': 'Pclass', 'type': 'bigint'},
                 {'name': 'Age', 'type': 'double'},
                 {'name': 'SibSp', 'type': 'bigint'},
                 {'name': 'Parch', 'type': 'bigint'},
                 {'name': 'Fare', 'type': 'double'}]


smv = sm.create_external_model_version("20230324-in-prod",
                                    configuration,
                                    target_column_name="Survived",
                                    class_labels=["0", "1"],
                                    features_list=features_list)

Example: minimalistic creation of a VertexAI model binary classification model

import dataiku
client = dataiku.api_client()
project = client.get_default_project()

sm = project.create_external_model("Raw Vertex AI Proxy Model", "BINARY_CLASSIFICATION", {"protocol": "vertex-ai", "region": "europe-west1"})
configuration = {
    "protocol": "vertex-ai",
    "project_id": "my-project",
    "endpoint_id": "123456789012345678"
}

smv = sm.create_external_model_version("legacy-model",
                                    configuration,
                                    class_labels=["0", "1"])

This model will have empty interpretation tabs and can not be evaluated later by an Evaluation Recipe, as its target is not defined, but it can be scored.

Example: create a Databricks Saved Model

import dataiku
client = dataiku.api_client()
project = client.get_default_project()

sm = project.create_external_model("Databricks External Model", "BINARY_CLASSIFICATION", {"protocol": "databricks","connection": "db"})

smv = sm.create_external_model_version("vX",
                  {"protocol": "databricks", "endpointName": "<endpoint-name>"},
                  target_column_name="Survived",
                  class_labels=["0", "1"],
                  input_dataset="train_titanic_prepared")

get_external_model_version_handler(version_id)#

Returns a handler to interact with an external model version (MLflow or Proxy model)

Parameters:: version_id (str) – identifier of the version, as returned by list_versions()
Returns:: external model version handler
Return type:: dataikuapi.dss.savedmodel.ExternalModelVersionHandler

get_metric_values(version_id)#

Gets the values of the metrics on the specified version of this saved model

Parameters:: version_id (str) – identifier of the version, as returned by list_versions()
Returns:: a list of metric objects and their value
Return type:: list

get_zone()#

Gets the flow zone of this saved model

Returns:: the saved model’s flow zone
Return type:: dataikuapi.dss.flow.DSSFlowZone

move_to_zone(zone)#

Moves this object to a flow zone

Parameters:: zone (dataikuapi.dss.flow.DSSFlowZone) – flow zone where the object should be moved

share_to_zone(zone)#

Shares this object to a flow zone

Parameters:: zone (dataikuapi.dss.flow.DSSFlowZone) – flow zone where the object should be shared

unshare_from_zone(zone)#

Unshares this object from a flow zone

Parameters:: zone (dataikuapi.dss.flow.DSSFlowZone) – flow zone from which the object shouldn’t be shared

get_usages()#

Gets the recipes referencing this model

Returns:: a list of usages
Return type:: list

get_object_discussions()#

Gets a handle to manage discussions on the saved model

Returns:: the handle to manage discussions
Return type:: dataikuapi.dss.discussion.DSSObjectDiscussions

delete()#: Deletes the saved model

class dataikuapi.dss.savedmodel.DSSSavedModelSettings(saved_model, settings)#

Handle on the settings of a saved model.

Important

Do not create this class directly, instead use dataikuapi.dss.DSSSavedModel.get_settings()

Parameters:

saved_model (dataikuapi.dss.savedmodel.DSSSavedModel) – the saved model object
settings (dict) – the settings of the saved model

get_raw()#

Returns the raw settings of the saved model

Returns:: the raw settings of the saved model
Return type:: dict

property prediction_metrics_settings#

Returns the metrics-related settings

Return type:: dict

property prediction_type#

Returns the type of prediction-related settings

Return type:: str

save()#: Saves the settings of this saved model

class dataiku.core.saved_model.SavedModelVersionMetrics(metrics, version_id)#

Handle to the metrics of a version of a saved model

get_performance_values()#

Retrieve the metrics as a dict.

Return type:: dict

get_computed()#

Get the underlying metrics object.

Return type:: dataiku.core.metrics.ComputedMetrics

class dataiku.Model(lookup, project_key=None, ignore_flow=False)#

Handle to interact with a saved model.

Note

This class is also available as dataiku.Model

Parameters:

lookup (string) – name or identifier of the saved model
project_key (string) – project key of the saved model, if it is not in the current project. (defaults to None, i.e. current project)
ignore_flow (boolean) – if True, create the handle regardless of whether the saved model is an input or output of the recipe (defaults to False)

static list_models(project_key=None)#

Retrieves the list of saved models of the given project.

Parameters:

project_key (str) – key of the project from which to list models. (defaults to None, i.e. current project)

Returns:

a list of the saved models of the project, as dict. Each dict contains at least the following fields:

id: identifier of the saved model
name: name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)
backendType: backend type of the saved model (PY_MEMORY / KERAS / MLLIB / H2O / DEEP_HUB)
versionsCount: number of versions in the saved model

Return type:

list[dict]

get_info()#

Gets the model information.

Returns:

the model information. Fields are:

id : identifier of the saved model
projectKey : project key of the saved model
name : name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)

Return type:

dict

get_id()#

Gets the identifier of the model.

Return type:: str

get_name()#

Gets the name of the model

Return type:: str

get_type()#

Gets the type of the model.

Returns:: the model type (PREDICTION / CLUSTERING)
Return type:: str

get_definition()#

Gets the model definition.

Return type:: dict

list_versions()#

Lists the model versions.

Note

The versionId field can be used to call the activate_version() method.

Returns:

Information about versions of the saved model, as a list of dict. Fields are:

versionId: identifier of the model version
active: whether this version is active or not
snippet: detailed dict containing version information

Return type:

list[dict]

activate_version(version_id)#

Activates the given version of the model.

Parameters:: version_id (str) – the identifier of the version to activate

get_version_metrics(version_id)#

Gets the training metrics of a given version of the model.

Parameters:: version_id (str) – the identifier of the version from which to retrieve metrics
Return type:: dataiku.core.saved_model.SavedModelVersionMetrics

get_version_checks(version_id)#

Gets the training checks of the given version of the model.

Parameters:: version_id (str) – the identifier of the version from which to retrieve checks
Return type:: dataiku.core.metrics.ComputedChecks

save_external_check_values(values_dict, version_id)#

Saves checks on the model, the checks are saved with the type “external”.

Parameters:

values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names
version_id (str) – the identifier of the version for which checks should be saved

Return type:

dict

get_predictor(version_id=None, optimize='BATCH')#

Returns a Predictor for the given version of the model.

Note

This predictor can then be used to preprocess and make predictions on a dataframe.

Parameters:

optimize – If set to LATENCY, attempts (if the model is compatible) to build a predictor optimized for latency (leveraging the dataikuscoring package). In this case, beware that the returned Predictor only supports the predictor.predict(df, with_input_cols=False, with_prediction=True, with_probas=False) method, with no other arguments passed.
version_id (str) – the identifier of the version from which to build the predictor (defaults to None, current active version)

Returns:

The predictor built from the given version of this model

Return type:

Union[dataiku.core.saved_model.BasePredictor, dataiku.core.saved_model.DkuScoringPredictor]

create_finetuned_llm_version(connection_name, quantization=None, set_active=True)#

Creates a new fine-tuned LLM version, using a context manager (experimental)

Upon exit of the context manager, the new model version is made available with the content of the working directory. The model weights must use the safetensors format. This model will be loaded at inference time with trust_remote_code=False.

Simple usage:

with saved_model.create_finetuned_llm_version("MyLocalHuggingfaceConnection") as finetuned_llm_version:
    # write model files to finetuned_llm_version.working_directory
# the new version is now available

Parameters:

connection_name (str) – name of the connection to link this version
quantization (str) – quantization mode, must be one of [None, “Q_4BIT”, “Q_8BIT”] (default: None)
set_active (bool) – if True, set the new version as active for this saved model (default: True)

Returns:

yields a FinetunedLLMVersionTrainingParameters object

MLflow models#

class dataikuapi.dss.savedmodel.ExternalModelVersionHandler(saved_model, version_id)#

Handler to interact with an External model version (MLflow import of Proxy model).

Important

Do not create this class directly, instead use dataikuapi.dss.savedmodel.DSSSavedModel.get_external_model_version_handler()

Parameters:

saved_model (dataikuapi.dss.savedmodel.DSSSavedModel) – the saved model object
version_id (str) – identifier of the version, as returned by dataikuapi.dss.savedmodel.DSSSavedModel.list_versions()

get_settings()#

Returns the settings of the MLFlow model version

Returns:: settings of the MLFlow model version
Return type:: dataikuapi.dss.savedmodel.MLFlowVersionSettings

set_core_metadata(target_column_name, class_labels=None, get_features_from_dataset=None, features_list=None, container_exec_config_name='LOCAL-CONFIG')#

Sets metadata for this MLFlow model version

In addition to target_column_name, one of get_features_from_dataset or features_list must be passed in order to be able to evaluate performance

Parameters:

target_column_name (str) – name of the target column. Mandatory in order to be able to evaluate performance
class_labels (list or None) – List of strings, ordered class labels. Mandatory in order to be able to evaluate performance on classification models
get_features_from_dataset (str or None) – name of a dataset to get feature names from
features_list (list or None) – list of {"name": "feature_name", "type": "feature_type"}
container_exec_config_name (str) –
name of the containerized execution configuration to use for running the evaluation process. * If value is “LOCAL-CONFIG”, set the variable to the active containerized configuration name. If running locally, set it to “NONE”. If the config name can’t be determined, fallback to “INHERIT”. * If value is “INHERIT”, the container execution configuration of the project will be used. * If value is “NONE”, local execution will be used (no container)

(defaults to LOCAL-CONFIG)

evaluate(dataset_ref, container_exec_config_name='LOCAL-CONFIG', selection=None, use_optimal_threshold=True, skip_expensive_reports=True)#

Evaluates the performance of this model version on a particular dataset. After calling this, the “result screens” of the MLFlow model version will be available (confusion matrix, error distribution, performance metrics, …) and more information will be available when calling: dataikuapi.dss.savedmodel.DSSSavedModel.get_version_details()

Evaluation is available only for models having BINARY_CLASSIFICATION, MULTICLASS or REGRESSION as prediction type. See DSSProject.create_mlflow_pyfunc_model().

Important

set_core_metadata() must be called before you can evaluate a dataset

Parameters:

dataset_ref (str or dataikuapi.dss.dataset.DSSDataset or dataiku.Dataset) – Evaluation dataset to use
container_exec_config_name (str) –
Name of the containerized execution configuration to use for running the evaluation process. * If value is “LOCAL-CONFIG”, set the variable to the active containerized configuration name. If running locally, set it to “NONE”. If the config name can’t be determined, fallback to “INHERIT”. * If value is “INHERIT”, the container execution configuration of the project will be used. * If value is “NONE”, local execution will be used (no container)

(defaults to LOCAL-CONFIG)
selection (dict or DSSDatasetSelectionBuilder or None) –
Sampling parameter for the evaluation.
- Example 1: DSSDatasetSelectionBuilder().with_head_sampling(100)
- Example 2: {"samplingMethod": "HEAD_SEQUENTIAL", "maxRecords": 100}
(defaults to None)
use_optimal_threshold (bool) – Choose between optimized or actual threshold. Optimized threshold has been computed according to the metric set on the saved model setting (i.e. prediction_metrics_settings['thresholdOptimizationMetric']) (defaults to True)
skip_expensive_reports (boolean) – Skip expensive/slow reports (e.g. feature importance).

class dataikuapi.dss.savedmodel.MLFlowVersionSettings(version_handler, data)#

Handle for the settings of an imported MLFlow model version.

Important

Do not create this class directly, instead use dataikuapi.dss.savedmodel.ExternalModelVersionHandler.get_settings()

Parameters:

version_handler (dataikuapi.dss.savedmodel.ExternalModelVersionHandler) – handler to interact with an external model version
data (dict) – raw settings of the imported MLFlow model version

property raw#

Returns:: The raw settings of the imported MLFlow model version
Return type:: dict

save()#: Saves the settings of this MLFlow model version

dataiku.Model#

class dataiku.Model(lookup, project_key=None, ignore_flow=False)

Handle to interact with a saved model.

Note

This class is also available as dataiku.Model

Parameters:

lookup (string) – name or identifier of the saved model
project_key (string) – project key of the saved model, if it is not in the current project. (defaults to None, i.e. current project)
ignore_flow (boolean) – if True, create the handle regardless of whether the saved model is an input or output of the recipe (defaults to False)

static list_models(project_key=None)

Retrieves the list of saved models of the given project.

Parameters:

project_key (str) – key of the project from which to list models. (defaults to None, i.e. current project)

Returns:

a list of the saved models of the project, as dict. Each dict contains at least the following fields:

id: identifier of the saved model
name: name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)
backendType: backend type of the saved model (PY_MEMORY / KERAS / MLLIB / H2O / DEEP_HUB)
versionsCount: number of versions in the saved model

Return type:

list[dict]

get_info()

Gets the model information.

Returns:

the model information. Fields are:

id : identifier of the saved model
projectKey : project key of the saved model
name : name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)

Return type:

dict

get_id()

Gets the identifier of the model.

Return type:: str

get_name()

Gets the name of the model

Return type:: str

get_type()

Gets the type of the model.

Returns:: the model type (PREDICTION / CLUSTERING)
Return type:: str

get_definition()

Gets the model definition.

Return type:: dict

list_versions()

Lists the model versions.

Note

The versionId field can be used to call the activate_version() method.

Returns:

Information about versions of the saved model, as a list of dict. Fields are:

versionId: identifier of the model version
active: whether this version is active or not
snippet: detailed dict containing version information

Return type:

list[dict]

activate_version(version_id)

Activates the given version of the model.

Parameters:: version_id (str) – the identifier of the version to activate

get_version_metrics(version_id)

Gets the training metrics of a given version of the model.

Parameters:: version_id (str) – the identifier of the version from which to retrieve metrics
Return type:: dataiku.core.saved_model.SavedModelVersionMetrics

get_version_checks(version_id)

Gets the training checks of the given version of the model.

Parameters:: version_id (str) – the identifier of the version from which to retrieve checks
Return type:: dataiku.core.metrics.ComputedChecks

save_external_check_values(values_dict, version_id)

Saves checks on the model, the checks are saved with the type “external”.

Parameters:

values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names
version_id (str) – the identifier of the version for which checks should be saved

Return type:

dict

get_predictor(version_id=None, optimize='BATCH')

Returns a Predictor for the given version of the model.

Note

This predictor can then be used to preprocess and make predictions on a dataframe.

Parameters:

optimize – If set to LATENCY, attempts (if the model is compatible) to build a predictor optimized for latency (leveraging the dataikuscoring package). In this case, beware that the returned Predictor only supports the predictor.predict(df, with_input_cols=False, with_prediction=True, with_probas=False) method, with no other arguments passed.
version_id (str) – the identifier of the version from which to build the predictor (defaults to None, current active version)

Returns:

The predictor built from the given version of this model

Return type:

Union[dataiku.core.saved_model.BasePredictor, dataiku.core.saved_model.DkuScoringPredictor]

create_finetuned_llm_version(connection_name, quantization=None, set_active=True)

Creates a new fine-tuned LLM version, using a context manager (experimental)

Simple usage:

with saved_model.create_finetuned_llm_version("MyLocalHuggingfaceConnection") as finetuned_llm_version:
    # write model files to finetuned_llm_version.working_directory
# the new version is now available

Parameters:

connection_name (str) – name of the connection to link this version
quantization (str) – quantization mode, must be one of [None, “Q_4BIT”, “Q_8BIT”] (default: None)
set_active (bool) – if True, set the new version as active for this saved model (default: True)

Returns:

yields a FinetunedLLMVersionTrainingParameters object

class dataiku.core.saved_model.Predictor(params, preprocessing, features, clf)

Object allowing to preprocess and make predictions on a dataframe.

get_features(): Returns the feature names generated by this predictor’s preprocessing

predict(df, with_input_cols=False, with_prediction=True, with_probas=True, with_conditional_outputs=False, with_proba_percentile=False, with_explanations=False, explanation_method='ICE', n_explanations=3, n_explanations_mc_steps=100, **kwargs)

Predict a dataframe. The results are returned as a dataframe with columns corresponding to the various prediction information.

Parameters:

with_input_cols – whether the input columns should also be present in the output
with_prediction – whether the prediction column should be present
with_probas – whether the probability columns should be present
with_conditional_outputs – whether the conditional outputs for this model should be present (binary classif)
with_proba_percentile – whether the percentile of the probability should be present (binary classif)
with_explanations – whether explanations should be computed for each prediction
explanation_method – method to compute the explanations
n_explanations – number of explanations to output for each prediction
n_explanations_mc_steps – number of Monte Carlo steps for SHAPLEY method (higher means more precise but slower)

preformat(df): Formats data originating from json (api node, interactive scoring) so that it’s compatible with preprocess

preprocess(df): Preprocess a dataframe. The results are returned as a numpy 2-dimensional matrix (which may be sparse). The columns of this matrix correspond to the generated features, which can be listed by the get_features property of this Predictor.

get_preprocessing()

Algorithm details#

This section documents which algorithms are available, and some of the settings for them.

These algorithm names can be used for dataikuapi.dss.ml.DSSMLTaskSettings.get_algorithm_settings() and dataikuapi.dss.ml.DSSMLTaskSettings.set_algorithm_enabled()

Note

This documentation does not cover all settings of all algorithms. To know which settings are available for an algorithm, use mltask_settings.get_algorithm_settings('ALGORITHM_NAME') and print the returned dictionary.

Generally speaking, most algorithm settings which are arrays means that this parameter can be grid-searched. All values will be tested as part of the hyperparameter optimization.

For more documentation of settings, please refer to the UI of the visual machine learning, which contains detailed documentation for all algorithm parameters

LOGISTIC_REGRESSION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:

{
    "multi_class": SingleCategoryHyperparameterSettings, # accepted valued: ['multinomial', 'ovr']
    "penalty": CategoricalHyperparameterSettings, # possible values: ["l1", "l2"]
    "C": NumericalHyperparameterSettings, # scaling: "LOGARITHMIC"
    "n_jobs": 2
}

RANDOM_FOREST_CLASSIFICATION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:

{
    "n_estimators": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "min_samples_leaf": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "max_tree_depth": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "max_feature_prop": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "max_features": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "selection_mode": SingleCategoryHyperparameterSettings, # accepted_values=['auto', 'sqrt', 'log2', 'number', 'prop']
    "n_jobs": 4
}

RANDOM_FOREST_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY
Main parameters: same as RANDOM_FOREST_CLASSIFICATION

EXTRA_TREES#

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

RIDGE_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

LASSO_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

LEASTSQUARE_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

SVC_CLASSIFICATION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

SVM_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

SGD_CLASSIFICATION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

SGD_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

GBT_CLASSIFICATION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

GBT_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

DECISION_TREE_CLASSIFICATION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

DECISION_TREE_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

LIGHTGBM_CLASSIFICATION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

LIGHTGBM_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

XGBOOST_CLASSIFICATION#

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

XGBOOST_REGRESSION#

Type: Prediction (regression)
Available on backend: PY_MEMORY

NEURAL_NETWORK#

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

KNN#

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

LARS#

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

MLLIB_LOGISTIC_REGRESSION#

Type: Prediction (binary or multiclass)
Available on backend: MLLIB

MLLIB_DECISION_TREE#

Type: Prediction (all kinds)
Available on backend: MLLIB

MLLIB_RANDOM_FOREST#

Type: Prediction (all kinds)
Available on backend: MLLIB

MLLIB_GBT#

Type: Prediction (all kinds)
Available on backend: MLLIB

MLLIB_LINEAR_REGRESSION#

Type: Prediction (regression)
Available on backend: MLLIB

MLLIB_NAIVE_BAYES#

Type: Prediction (all kinds)
Available on backend: MLLIB

Other#

SCIKIT_MODEL
MLLIB_CUSTOM