Machine learning#
For usage information and examples, see Visual Machine learning
API Reference#
Interaction with a ML Task#
- class dataikuapi.dss.ml.DSSMLTask(client, project_key, analysis_id, mltask_id)#
A handle to interact with a ML Task for prediction or clustering in a DSS visual analysis.
Important
To create a new ML Task, use one of the following methods:
dataikuapi.dss.project.DSSProject.create_prediction_ml_task()
,dataikuapi.dss.project.DSSProject.create_clustering_ml_task()
ordataikuapi.dss.project.DSSProject.create_timeseries_forecasting_ml_task()
.- static from_full_model_id(client, fmi, project_key=None)#
Static method returning a DSSMLTask object representing a pre-existing ML Task
- delete()#
Deletes the ML task
- wait_guess_complete()#
Waits for the ML Task guessing to be complete.
This should be called immediately after the creation of a new ML Task if the ML Task was created with
wait_guess_complete = False
, before callingget_settings()
ortrain()
.
- get_status()#
Gets the status of this ML Task
- Returns:
A dictionary containing the ML Task status
- Return type:
dict
- get_settings()#
Gets the settings of this ML Task.
This should be used whenever you need to modify the settings of an existing ML Task.
- Returns:
A DSSMLTaskSettings object.
- Return type:
- train(session_name=None, session_description=None, run_queue=False)#
Trains models for this ML Task.
This method waits for training to complete. If you instead want to train asynchronously, use
start_train()
andwait_train_complete()
.This method returns a list of trained model identifiers. These refer to models that have been trained during this specific training session, rather than all of the trained models available on this ML task. To get all identifiers for all models trained across all training sessions, use
get_trained_models_ids()
.These identifiers can be used for
get_trained_model_snippet()
,get_trained_model_details()
anddeploy_to_flow()
.- Parameters:
session_name (str, optional) – Optional name for the session (defaults to None)
session_description (str, optional) – Optional description for the session (defaults to None)
run_queue (bool) – Whether to run any queued sessions after the training completes (defaults to False)
- Returns:
A list of model identifiers
- Return type:
list[str]
- ensemble(model_ids, method)#
Creates an ensemble model from a set of models.
This method waits for the ensemble training to complete. If you want to train asynchronously, use
start_ensembling()
andwait_train_complete()
.This method returns the identifier of the trained ensemble. To get all identifiers for all models trained across all training sessions, use
get_trained_models_ids()
.This returned identifier can be used for
get_trained_model_snippet()
,get_trained_model_details()
anddeploy_to_flow()
.- Parameters:
model_ids (list[str]) – A list of model identifiers to ensemble (must not be empty)
method (str) – The ensembling method. Must be one of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL
- Returns:
The model identifier of the resulting ensemble model
- Return type:
str
- start_train(session_name=None, session_description=None, run_queue=False)#
Starts asynchronously a new training session for this ML Task.
This returns immediately, before training is complete. To wait for training to complete, use
wait_train_complete()
.- Parameters:
session_name (str, optional) – Optional name for the session (defaults to None)
session_description (str, optional) – Optional description for the session (defaults to None)
run_queue (bool) – Whether to run any queued sessions after the training completes (defaults to False)
- start_ensembling(model_ids, method)#
Creates asynchronously an ensemble model from a set of models
This returns immediately, before training is complete. To wait for training to complete, use
wait_train_complete()
- Parameters:
model_ids (list[str]) – A list of model identifiers to ensemble (must not be empty)
method (str) – The ensembling method. Must be one of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL
- Returns:
The model identifier of the ensemble
- Return type:
str
- wait_train_complete()#
Waits for training to be completed
To be used following any asynchronous training started with
start_train()
orstart_ensembling()
- get_trained_models_ids(session_id=None, algorithm=None)#
Gets the list of trained model identifiers for this ML task.
These identifiers can be used for
get_trained_model_snippet()
anddeploy_to_flow()
.The two optional filter params can be used together.
- Parameters:
session_id (str, optional) – Optional filter to return only IDs of models from a specific session.
algorithm (str, optional) – Optional filter to return only IDs of models with a specific algorithm.
- Returns:
A list of model identifiers
- Return type:
list[str]
- get_trained_model_snippet(id=None, ids=None)#
Gets a quick summary of a trained model, as a dict.
This method can either be given a single model id, via the id param, or a list of model ids, via the ids param.
For complete model information and a structured object, use
get_trained_model_details()
.- Parameters:
id (str, optional) – A model id (defaults to None)
ids (list[str]) – A list of model ids (defaults to None)
- Returns:
Either a quick summary of one trained model as a dict, or a list of model summary dicts
- Return type:
Union[dict, list[dict]]
- get_trained_model_details(id)#
Gets details for a trained model.
- Parameters:
id (str) – Identifier of the trained model, as returned by
get_trained_models_ids()
- Returns:
A
DSSTrainedPredictionModelDetails
orDSSTrainedClusteringModelDetails
representing the details of this trained model.- Return type:
Union[
DSSTrainedPredictionModelDetails
,DSSTrainedClusteringModelDetails
]
- delete_trained_model(model_id)#
Deletes a trained model
- Parameters:
model_id (str) – Model identifier, as returned by
get_trained_models_ids()
.
- train_queue()#
Trains each session in this ML Task’s queue, or until the queue is paused.
- Returns:
A dict including the next sessionID to be trained in the queue
- Return type:
dict
- deploy_to_flow(model_id, model_name, train_dataset, test_dataset=None, redo_optimization=True)#
Deploys a trained model from this ML Task to the flow.
Creates a new saved model and its parent training recipe in the Flow.
- Parameters:
model_id (str) – Model identifier, as returned by
get_trained_models_ids()
model_name (str) – Name of the saved model when deployed to the Flow
train_dataset (str) – Name of the dataset to use as train set. May either be a short name or a PROJECT.name long name (when using a shared dataset)
test_dataset (str, optional) – Name of the dataset to use as test set. If None (default), the train/test split will be applied over the train set. Only for PREDICTION tasks. May either be a short name or a PROJECT.name long name (when using a shared dataset).
redo_optimization (bool) – Whether to redo the hyperparameter optimization phase (defaults to True). Only for PREDICTION tasks.
- Returns:
A dict containing: “savedModelId” and “trainRecipeName” - Both can be used to obtain further handles
- Return type:
dict
- redeploy_to_flow(model_id, recipe_name=None, saved_model_id=None, activate=True)#
Redeploys a trained model from this ML Task to an existing saved model and training recipe in the flow.
Either the training recipe recipe_name or the saved_model_id needs to be specified.
- Parameters:
model_id (str) – Model identifier, as returned by
get_trained_models_ids()
recipe_name (str, optional) – Name of the training recipe to update (defaults to None)
saved_model_id (str, optional) – Name of the saved model to update (defaults to None)
activate (bool) – If True (default), make the newly deployed model version become the active version
- Returns:
A dict containing: “impactsDownstream” - whether the active saved mode version changed and downstream recipes are impacted
- Return type:
dict
- remove_unused_splits()#
Deletes all stored split data that is no longer in use for this ML Task.
You should generally not need to call this method.
- remove_all_splits()#
Deletes all stored split data for this ML Task.
This operation saves disk space.
After performing this operation, it will not be possible anymore to:
Ensemble already trained models
View the “predicted data” or “charts” for already trained models
Resume training of models for which optimization had been previously interrupted
Training new models remains possible
- guess(prediction_type=None, reguess_level=None, target_variable=None, timeseries_identifiers=None, time_variable=None, full_reguess=None)#
Reguess the settings of the ML Task.
When no optional parameters are given, this will reguess all the settings of the ML Task.
For prediction ML tasks only, a new target variable or prediction type can be passed, and this will subsequently reguess the impacted settings.
- Parameters:
prediction_type (str, optional) – The desired prediction type. Only valid for prediction tasks of either BINARY_CLASSIFICATION, MULTICLASS or REGRESSION type, ignored otherwise. Cannot be set if either target_variable, time_variable, or timeseries_identifiers is also specified. (defaults to None)
target_variable (str, optional) – The desired target variable. Only valid for prediction tasks, ignored for clustering. Cannot be set if either prediction_type, time_variable, or timeseries_identifiers is also specified. (defaults to None)
timeseries_identifiers (list[str], optional) – Only valid for time series forecasting tasks. List of columns to be used as time series identifiers. Cannot be set if either prediction_type, target_variable, or time_variable is also specified. (defaults to None)
time_variable (str, optional) – The desired time variable column. Only valid for time series forecasting tasks. Cannot be set if either prediction_type, target_variable, or timeseries_identifiers is also specified. (defaults to None)
full_reguess (bool, optional) – Scope of the reguess process: whether it should reguess all the settings after changing a core parameter, or only reguess impacted settings (e.g. target remapping when changing the target, metrics when changing the prediction type…). Ignored if no core parameter is given. Only valid for prediction tasks and therefore also ignored for clustering. (defaults to True)
reguess_level (str, optional) –
Deprecated, use full_reguess instead. Only valid for prediction tasks. Can be one of the following values:
TARGET_CHANGE: Change the target if target_variable is specified, reguess the target remapping, and clear the model’s assertions if any. Equivalent to
full_reguess=False
(recommended usage)FULL_REGUESS: All the settings of the ML task are reguessed. Equivalent to
full_reguess=True
(recommended usage)
Manipulation of settings#
- class dataikuapi.dss.ml.HyperparameterSearchSettings(raw_settings)#
Object to read and modify hyperparameter search settings.
This is available for all non-clustering ML Tasks.
Important
Do not create this class directly, use
AbstractTabularPredictionMLTaskSettings.get_hyperparameter_search_settings()
- property strategy#
- Returns:
The hyperparameter search strategy. Will be one of “GRID” | “RANDOM” | “BAYESIAN”.
- Return type:
str
- set_grid_search(shuffle=True, seed=1337)#
Sets the search strategy to “GRID”, to perform a grid-search over the hyperparameters.
- Parameters:
shuffle (bool) – if True (default), iterate over a shuffled grid as opposed to lexicographical iteration over the cartesian product of the hyperparameters
seed (int) – Seed value used to ensure reproducible results (defaults to 1337)
- set_random_search(seed=1337)#
Sets the search strategy to “RANDOM”, to perform a random search over the hyperparameters.
- Parameters:
seed (int) – Seed value used to ensure reproducible results (defaults to 1337)
- set_bayesian_search(seed=1337)#
Sets the search strategy to “BAYESIAN”, to perform a Bayesian search over the hyperparameters.
- Parameters:
seed (int) – Seed value used to ensure reproducible results (defaults to 1337)
- property validation_mode#
- Returns:
The cross-validation strategy. Will be one of “KFOLD” | “SHUFFLE” | “TIME_SERIES_KFOLD” | “TIME_SERIES_SINGLE_SPLIT” | “CUSTOM”.
- Return type:
str
- property fold_offset#
- Returns:
Whether there is an offset between validation sets, to avoid overlap between cross-test sets (model evaluation) and cross-validation sets (hyperparameter search), if both are using k-fold. Only relevant for time series forecasting
- Return type:
bool
- property equal_duration_folds#
- Returns:
Whether every fold in cross-test and cross-validation should be of equal duration when using k-fold. Only relevant for time series forecasting.
- Return type:
bool
- property cv_seed#
- Returns:
cross-validation seed for splitting the data during hyperparameter search
- Return type:
int
- set_kfold_validation(n_folds=5, stratified=True, cv_seed=1337)#
Sets the validation mode to k-fold cross-validation.
The mode will be set to either “KFOLD” or “TIME_SERIES_KFOLD”, depending on whether time-based ordering is enabled.
- Parameters:
n_folds (int) – The number of folds used for the hyperparameter search (defaults to 5)
stratified (bool) – If True, keep the same proportion of each target classes in all folds (defaults to True)
cv_seed (int) – Seed for cross-validation (defaults to 1337)
- set_single_split_validation(split_ratio=0.8, stratified=True, cv_seed=1337)#
Sets the validation mode to single split.
The mode will be set to either “SHUFFLE” or “TIME_SERIES_SINGLE_SPLIT”, depending on whether time-based ordering is enabled.
- Parameters:
split_ratio (float) – The ratio of the data used for training during hyperparameter search (defaults to 0.8)
stratified (bool) – If True, keep the same proportion of each target classes in both splits (defaults to True)
cv_seed (int) – Seed for cross-validation (defaults to 1337)
- set_custom_validation(code=None)#
Sets the validation mode to “CUSTOM”, and sets the custom validation code.
Your code must create a ‘cv’ variable. This ‘cv’ must be compatible with the scikit-learn ‘CV splitter class family.
Example splitter classes can be found here: https://scikit-learn.org/stable/modules/classes.html#splitter-classes
See also: https://scikit-learn.org/stable/glossary.html#term-CV-splitter
This example code uses the ‘repeated K-fold’ splitter of scikit-learn:
from sklearn.model_selection import RepeatedKFold cv = RepeatedKFold(n_splits=3, n_repeats=5)
- Parameters:
code (str) – definition of the validation
- set_search_distribution(distributed=False, n_containers=4)#
Sets the distribution parameters for the hyperparameter search execution.
- Parameters:
distributed (bool) – if True, distribute search in the Kubernetes cluster selected in the runtime environment’s containerized execution configuration (defaults to False)
n_containers (int) – number of containers to use for the distributed search (defaults to 4)
- property distributed#
- Returns:
Whether the search is set to distributed
- Return type:
bool
- property timeout#
- Returns:
The search timeout
- Return type:
int
- property n_iter#
- Returns:
The number of search iterations
- Return type:
int
- property parallelism#
- Returns:
The number of threads used for the search
- Return type:
int
- class dataikuapi.dss.ml.DSSMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#
Object to read and modify the settings of an existing ML task.
Important
Do not create this class directly, use
DSSMLTask.get_settings()
insteadUsage example:
project_key = 'DKU_CHURN' fmi = 'A-DKU_CHURN-RADgquHe-5nJtl88L-s1-pp1-m1' client = dataiku.api_client() task = dataikuapi.dss.ml.DSSMLTask.from_full_model_id(client, fmi, project_key) task_settings = task.get_settings() task_settings.set_diagnostics_enabled(False) task_settings.save()
- get_raw()#
Gets the raw settings of this ML Task.
This returns a reference to the raw settings, rather than a copy, so any changes made to the returned object will be reflected when saving.
- Returns:
The raw settings of this ML Task
- Return type:
dict
- get_feature_preprocessing(feature_name)#
Gets the feature preprocessing parameters for a particular feature.
- This returns a reference to the selected features’ settings, rather than a copy,
so any changes made to the returned object will be reflected when saving.
- Parameters:
feature_name (str) – Name of the feature whose parameters will be returned
- Returns:
A dict of the preprocessing settings for a feature
- Return type:
dict
- foreach_feature(fn, only_of_type=None)#
Applies a function to all features, including REJECTED features, except for the target feature
- Parameters:
fn (function) – Function handle of the form
fn(feature_name, feature_params) -> dict
, where feature_name is the feature name as a str, and feature_params is a dict containing the specific feature params. The function should return a dict of edited parameters for the feature.only_of_type (Union[str, None], optional) – If set, only applies the function to features matching the given type. Must be one of
CATEGORY
,NUMERIC
,TEXT
orVECTOR
.
- reject_feature(feature_name)#
Marks a feature as ‘rejected’, disabling it from being used as an input when training. This reverses the effect of the
use_feature()
method.- Parameters:
feature_name (str) – Name of the feature to reject
- use_feature(feature_name)#
Marks a feature to be used (enabled) as an input when training. This reverses the effect of the
reject_feature()
method.- Parameters:
feature_name (str) – Name of the feature to use/enable
- get_algorithm_settings(algorithm_name)#
Caution
Not Implemented, throws NotImplementedError
- get_diagnostics_settings()#
Gets the ML Tasks diagnostics’ settings.
This returns a reference to the diagnostics’ settings, rather than a copy, so changes made to the returned object will be reflected when saving.
This method returns a dictionary of the settings with:
enabled (boolean): Indicates if the diagnostics are enabled globally, if False, all diagnostics will be disabled
- settings (List[dict]): A list of dicts.
Each dict will contain the following:
type (str): The diagnostic type name, in uppercase
enabled (boolean): Indicates if the diagnostic type is enabled. If False, all diagnostics of that type will be disabled
Please refer to the documentation for details on available diagnostics.
- Returns:
A dict of diagnostics settings
- Return type:
dict
- set_diagnostics_enabled(enabled)#
Globally enables or disables the calculation of all diagnostics
- Parameters:
enabled (bool) – True if the diagnostics should be enabled, False otherwise
- set_diagnostic_type_enabled(diagnostic_type, enabled)#
Enables or disables the calculation of a set of diagnostics given their type.
Attention
This is overridden by whether diagnostics are enabled globally; If diagnostics are disabled globally, nothing will be calculated.
Diagnostics can be enabled/disabled globally via the
set_diagnostics_enabled()
method.Usage example:
mltask_settings = task.get_settings() mltask_settings.set_diagnostics_enabled(True) mltask_settings.set_diagnostic_type_enabled("ML_DIAGNOSTICS_DATASET_SANITY_CHECKS", False) mltask_settings.set_diagnostic_type_enabled("ML_DIAGNOSTICS_LEAKAGE_DETECTION", False) mltask_settings.save()
Please refer to the documentation for details on available diagnostics.
- Parameters:
diagnostic_type (str) – Name of the diagnostic type, in uppercase.
enabled (bool) – True if the diagnostic should be enabled, False otherwise
- set_algorithm_enabled(algorithm_name, enabled)#
Enables or disables an algorithm given its name.
Exact algorithm names can be found using the
get_all_possible_algorithm_names()
method.Please refer to the documentation for further information on available algorithms.
- Parameters:
algorithm_name (str) – Name of the algorithm, in uppercase.
enabled (bool) – True if the algorithm should be enabled, False otherwise
- disable_all_algorithms()#
Disables all algorithms
- get_all_possible_algorithm_names()#
Gets the list of possible algorithm names
This can be used to find the list of valid identifiers for the
set_algorithm_enabled()
andget_algorithm_settings()
methods.This includes all possible algorithms, regardless of the prediction kind (regression/classification etc) or engine, so some algorithms may be irrelevant to the current task.
- Returns:
The list of algorithm names as a list of strings
- Return type:
list[str]
- get_enabled_algorithm_names()#
Gets the list of enabled algorithm names
- Returns:
The list of enabled algorithm names
- Return type:
list[str]
- get_enabled_algorithm_settings()#
Gets the settings for each enabled algorithm
Returns a dictionary where:
Each key is the name of an enabled algorithm
Each value is the result of calling
get_algorithm_settings()
with the key as the parameter
- Returns:
The dict of enabled algorithm names with their settings
- Return type:
dict
- set_metric(metric=None, custom_metric=None, custom_metric_greater_is_better=True, custom_metric_use_probas=False, custom_metric_name=None)#
Sets the score metric to optimize for a prediction ML Task
When using a custom optimisation metric, the metric parameter must be kept as None, and a string containing the metric code should be passed to the custom_metric parameter.
- Parameters:
metric (str, optional) – Name of the metric to use. Must be left empty to use a custom metric (defaults to None).
custom_metric (str, optional) – Code for the custom optimisation metric (defaults to None)
custom_metric_greater_is_better (bool, optional) – Whether the custom metric function returns a score (True, default) or a loss (False). Score functions return higher values as the model improves, whereas loss functions return lower values.
custom_metric_use_probas (bool, optional) – If True, will use the classes’ probas or the predicted value (for classification) (defaults to False)
custom_metric_name (str, optional) – Name of your custom metric. If not set, it will generate one.
- add_custom_python_model(name='Custom Python Model', code='')#
Adds a new custom python model and enables it.
Your code must create a ‘clf’ variable. This clf must be a scikit-learn compatible estimator, ie, it should:
have at least fit(X,y) and predict(X) methods
inherit sklearn.base.BaseEstimator
handle the attributes in the __init__ function
have a classes_ attribute (for classification tasks)
have a predict_proba method (optional)
Example:
mltask_settings = task.get_settings() code = """ from sklearn.ensemble import AdaBoostClassifier clf = AdaBoostClassifier(n_estimators=20) """ mltask_settings.add_custom_python_model(name="sklearn adaboost custom", code=code) mltask_settings.save()
See: https://doc.dataiku.com/dss/latest/machine-learning/custom-models.html
- Parameters:
name (str) – The name of the custom model (defaults to “Custom Python Model”)
code (str) – The code for the custom model (defaults to “”)
- add_custom_mllib_model(name='Custom MLlib Model', code='')#
Adds a new custom MLlib model and enables it
This example has sample code that uses a standard MLlib algorithm, the RandomForestClassifier:
mltask_settings = task.get_settings() code = """ // import the Estimator from spark.ml import org.apache.spark.ml.classification.RandomForestClassifier // instantiate the Estimator new RandomForestClassifier() .setLabelCol("Survived") // Must be the target column .setFeaturesCol("__dku_features") // Must always be __dku_features .setPredictionCol("prediction") // Must always be prediction .setNumTrees(50) .setMaxDepth(8) """ mltask_settings.add_custom_mllib_model(name="spark random forest custom", code=code) mltask_settings.save()
- Parameters:
name (str) – The name of the custom model (defaults to “Custom MLlib Model”)
code (str) – The code for the custom model (defaults to “”)
- save()#
Saves the settings back to the ML Task
- class dataikuapi.dss.ml.DSSPredictionMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#
- class PredictionTypes#
Possible prediction types
- BINARY = 'BINARY_CLASSIFICATION'#
- REGRESSION = 'REGRESSION'#
- MULTICLASS = 'MULTICLASS'#
- OTHER = 'OTHER'#
- get_all_possible_algorithm_names()#
Returns the list of possible algorithm names.
This includes the names of algorithms from installed plugins.
This can be used as the list of valid identifiers for
set_algorithm_enabled()
andget_algorithm_settings()
.This includes all possible algorithms, regardless of the prediction kind (regression/classification) or the engine, so some algorithms may be irrelevant to the current task.
- Returns:
The list of algorithm names
- Return type:
list[str]
- get_enabled_algorithm_names()#
Gets the list of enabled algorithm names
- Returns:
The list of enabled algorithm names
- Return type:
list[str]
- get_algorithm_settings(algorithm_name)#
Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.
This method returns the settings for this algorithm as an PredictionAlgorithmSettings (extended dict). All algorithm dicts have at least an “enabled” property/key in the settings. The “enabled” property/key indicates whether this algorithm will be trained.
Other settings are algorithm-dependent and are the various hyperparameters of the algorithm. The precise properties/keys for each algorithm are not all documented. You can print the returned AlgorithmSettings to learn more about the settings of each particular algorithm.
Please refer to the documentation for details on available algorithms.
- Parameters:
algorithm_name (str) – Name of the algorithm, in uppercase.
- Returns:
A PredictionAlgorithmSettings (extended dict) for one of the built-in prediction algorithms
- Return type:
PredictionAlgorithmSettings
- split_ordered_by(feature_name, ascending=True)#
Deprecated. Use split_params.set_time_ordering()
- remove_ordered_split()#
Deprecated. Use split_params.unset_time_ordering()
- use_sample_weighting(feature_name)#
Deprecated. use set_weighting()
- set_weighting(method, feature_name=None)#
Sets the method for weighting samples.
If there was a WEIGHT feature declared previously, it will be set back as an INPUT feature first.
- Parameters:
method (str) – Weighting nethod to use. One of NO_WEIGHTING, SAMPLE_WEIGHT (requires a feature name), CLASS_WEIGHT or CLASS_AND_SAMPLE_WEIGHT (requires a feature name)
feature_name (str, optional) – Name of the feature to use as sample weight
- remove_sample_weighting()#
Deprecated. Use set_weighting(method=”NO_WEIGHTING”) instead
- get_assertions_params()#
Retrieves the ML Task assertion parameters
- Returns:
The assertions parameters for this ML task
- Return type:
DSSMLAssertionsParams
- get_hyperparameter_search_settings()#
Gets the hyperparameter search parameters of the current DSSPredictionMLTaskSettings instance as a HyperparameterSearchSettings object. This object can be used to both get and set properties relevant to hyperparameter search, such as search strategy, cross-validation method, execution limits and parallelism.
- Returns:
A HyperparameterSearchSettings
- Return type:
- get_prediction_type()#
- get_split_params()#
Gets a handle to modify train/test splitting params.
- Return type:
- property split_params#
Deprecated. Use get_split_params()
- class dataikuapi.dss.ml.DSSClusteringMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#
- get_algorithm_settings(algorithm_name)#
Gets the training settings of a particular algorithm.
This returns a reference to the algorithm’s settings, rather than a copy, so any changes made to the returned object will be reflected when saving.
This method returns a dictionary of the settings for this algorithm. All algorithm dicts contain an “enabled” key, which indicates whether this algorithm will be trained
Other settings are algorithm-dependent and include the various hyperparameters of the algorithm. The precise keys for each algorithm are not all documented. You can print the returned dictionary to learn more about the settings of each particular algorithm.
Please refer to the documentation for details on available algorithms.
- Parameters:
algorithm_name (str) – Name of the algorithm, in uppercase.
- Returns:
A dict containing the settings for the algorithm
- Return type:
dict
- class dataikuapi.dss.ml.DSSTimeseriesForecastingMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)#
-
- get_time_step_params()#
Gets the time step parameters for the time series forecasting task.
This returns a reference to the time step parameters, rather than a copy, so any changes made to the returned object will be reflected when saving.
- Returns:
A dict of the time step parameters
- Return type:
dict
- set_time_step(time_unit=None, n_time_units=None, end_of_week_day=None, reguess=True, update_algorithm_settings=True, unit_alignment=None)#
Sets the time step parameters for the time series forecasting task.
- Parameters:
time_unit (str, optional) – time unit for forecasting step. Valid values are: MILLISECOND, SECOND, MINUTE, HOUR, DAY, BUSINESS_DAY, WEEK, MONTH, QUARTER, HALF_YEAR, YEAR (defaults to None, i.e. don’t change)
n_time_units (int, optional) – number of time units within a time step (defaults to None, i.e. don’t change)
end_of_week_day (int, optional) – only useful for the WEEK time unit. Valid values are: 1 (Sunday), 2 (Monday), …, 7 (Saturday) (defaults to None, i.e. don’t change)
reguess (bool) – Whether to reguess the ML task settings after changing the time step params (defaults to True)
update_algorithm_settings (bool) – Whether the algorithm settings should also be reguessed if reguessing the ML Task (defaults to True)
unit_alignment (int, optional) – month for each step when time_unit is QUARTER or YEAR, between 1 and 3 for QUARTER and 1 and 12 for YEAR (defaults to None, i.e. don’t change)
- get_resampling_params()#
Gets the time series resampling parameters for the time series forecasting task.
This returns a reference to the time series resampling parameters, rather than a copy, so any changes made to the returned object will be reflected when saving.
- Returns:
A dict of the resampling parameters
- Return type:
dict
- set_numerical_interpolation(method=None, constant=None)#
Sets the time series resampling numerical interpolation parameters
- Parameters:
method (str, optional) – Interpolation method. Valid values are: NEAREST, PREVIOUS, NEXT, LINEAR, QUADRATIC, CUBIC, CONSTANT (defaults to None, i.e. don’t change)
constant (float, optional) – Value for the CONSTANT interpolation method (defaults to None, i.e. don’t change)
- set_numerical_extrapolation(method=None, constant=None)#
Sets the time series resampling numerical extrapolation parameters
- Parameters:
method (str, optional) – Extrapolation method. Valid values are: PREVIOUS_NEXT, NO_EXTRAPOLATION, CONSTANT, LINEAR, QUADRATIC, CUBIC (defaults to None, i.e. don’t change)
constant (float, optional) – Value for the CONSTANT extrapolation method (defaults to None)
- set_categorical_imputation(method=None, constant=None)#
Sets the time series resampling categorical imputation parameters
- Parameters:
method (str, optional) – Imputation method. Valid values are: MOST_COMMON, NULL, CONSTANT, PREVIOUS_NEXT, PREVIOUS, NEXT (defaults to None, i.e. don’t change)
constant (str, optional) – Value for the CONSTANT imputation method (defaults to None, i.e. don’t change)
- set_duplicate_timestamp_handling(method)#
Sets the time series duplicate timestamp handling method
- Parameters:
method (str) – Duplicate timestamp handling method. Valid values are: FAIL_IF_CONFLICTING, DROP_IF_CONFLICTING, MEAN_MODE.
- property forecast_horizon#
- Returns:
Number of time steps to be forecast
- Return type:
int
- set_forecast_horizon(forecast_horizon, reguess=True, update_algorithm_settings=True, validation_horizons=None)#
Sets the time series forecast horizon
- Parameters:
forecast_horizon (int) – Number of time steps to be forecast
reguess (bool) – Whether to reguess the ML task settings after changing the forecast horizon (defaults to True)
update_algorithm_settings (bool) – Whether the algorithm settings should be reguessed after the forecast horizon (defaults to True)
validation_horizons (int|None) – The number of validation horizons to be set. If omitted, retains the previous ratio.
- property evaluation_gap#
- Returns:
Number of skipped time steps for evaluation
- Return type:
int
- property time_variable#
- Returns:
Feature used as time variable (read-only)
- Return type:
str
- property timeseries_identifiers#
- Returns:
Features used as time series identifiers (read-only copy)
- Return type:
list
- property quantiles_to_forecast#
- Returns:
List of quantiles to forecast
- Return type:
list
- property skip_too_short_timeseries_for_training#
- Returns:
Whether we skip too short time series during training, or fail the whole training when only one time series is too short.
- Return type:
bool
- get_algorithm_settings(algorithm_name)#
Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.
This method returns the settings for this algorithm as an PredictionAlgorithmSettings (extended dict). All algorithm dicts have at least an “enabled” property/key in the settings. The “enabled” property/key indicates whether this algorithm will be trained.
Other settings are algorithm-dependent and are the various hyperparameters of the algorithm. The precise properties/keys for each algorithm are not all documented. You can print the returned AlgorithmSettings to learn more about the settings of each particular algorithm.
Please refer to the documentation for details on available algorithms.
- Parameters:
algorithm_name (str) – Name of the algorithm, in uppercase.
- Returns:
A PredictionAlgorithmSettings (extended dict) for one of the built-in prediction algorithms
- Return type:
PredictionAlgorithmSettings
- get_assertions_params()#
Retrieves the assertions parameters for this ml task
- Return type:
DSSMLAssertionsParams
- get_hyperparameter_search_settings()#
Gets the hyperparameter search parameters of the current DSSPredictionMLTaskSettings instance as a HyperparameterSearchSettings object. This object can be used to both get and set properties relevant to hyperparameter search, such as search strategy, cross-validation method, execution limits and parallelism.
- Returns:
A HyperparameterSearchSettings
- Return type:
- get_prediction_type()#
- get_split_params()#
Gets a handle to modify train/test splitting params.
- Return type:
- property split_params#
Deprecated. Use get_split_params()
- class dataikuapi.dss.ml.PredictionSplitParamsHandler(mltask_settings)#
Object to modify the train/test dataset splitting params.
Important
Do not create this class directly, use
DSSMLTaskSettings.get_split_params()
- SPLIT_PARAMS_KEY = 'splitParams'#
- get_raw()#
Gets the raw settings of the prediction split configuration.
This returns a reference to the raw settings, rather than a copy, so any changes made to the returned object will be reflected when saving.
- Returns:
The raw prediction split parameter settings
- Return type:
dict
- set_split_random(train_ratio=0.8, selection=None, dataset_name=None)#
Sets the train/test split mode to random splitting over an extract from a single dataset
- Parameters:
train_ratio (float) – Ratio of rows to use for the train set. Must be between 0 and 1 (defaults to 0.8)
selection (Union[
dataikuapi.dss.utils.DSSDatasetSelectionBuilder
, dict], optional) – Optional builder or dict defining the settings of the extract from the dataset (defaults to None). A dict with the appropriate schema can be generated viadataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
dataset_name (str, optional) – Name of the dataset to split on. If None (default), uses the main dataset used to create the visual analysis
- set_split_kfold(n_folds=5, selection=None, dataset_name=None)#
Sets the train/test split mode to k-fold splitting over an extract from a single dataset
- Parameters:
n_folds (int) – number of folds. Must be greater than 0 (defaults to 5)
selection (Union[
DSSDatasetSelectionBuilder
, dict], optional) – Optional builder or dict defining the settings of the extract from the dataset (defaults to None) A dict with the appropriate schema can be generated viadataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
dataset_name (str, optional) – Name of the dataset to split on. If None (default), uses the main dataset used to create the visual analysis
- set_split_explicit(train_selection, test_selection, dataset_name=None, test_dataset_name=None, train_filter=None, test_filter=None)#
Sets the train/test split to an explicit extract from one or two dataset(s)
- Parameters:
train_selection (Union[
DSSDatasetSelectionBuilder
, dict]) – Builder or dict defining the settings of the extract for the train dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated viadataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
test_selection (Union[
DSSDatasetSelectionBuilder
, dict]) – Builder or dict defining the settings of the extract for the test dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated viadataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
dataset_name (str, optional) – Name of the dataset to split on. If None (default), uses the main dataset used to create the visual analysis
test_dataset_name (str, optional) – Optional name of a second dataset to use for the test data extract. If None (default), both extracts are done from dataset_name
train_filter (Union[
DSSFilterBuilder
, dict], optional) – Builder or dict defining the settings of the filter for the train dataset. Defaults to None (won’t be changed). A dict with the appropriate schema can be generated viadataikuapi.dss.utils.DSSFilterBuilder.build()
test_filter (Union[
DSSFilterBuilder
, dict], optional) – Builder or dict defining the settings of the filter for the test dataset. Defaults to None (won’t be changed). A dict with the appropriate schema can be generated viadataikuapi.dss.utils.DSSFilterBuilder.build()
- set_time_ordering(feature_name, ascending=True)#
Enables time based ordering and sets the feature upon which to sort the train/test split and hyperparameter optimization data by time.
- Parameters:
feature_name (str) – The name of the feature column to use. This feature must be present in the output of the preparation steps of the analysis. When there are no preparation steps, it means this feature must be present in the analyzed dataset.
ascending (bool) – True (default) means the test set is expected to have larger time values than the train set
- unset_time_ordering()#
Disables time-based ordering for train/test split and hyperparameter optimization
- has_time_ordering()#
- Returns:
True if the split uses time-based ordering
- Return type:
bool
- get_time_ordering_variable()#
- Returns:
If enabled, the name of the ordering variable for time based ordering (the feature name). Returns None if time based ordering is not enabled.
- Return type:
Union[str, None]
- is_time_ordering_ascending()#
- Returns:
True if the time-ordering is set to sort in ascending order. Returns None if time based ordering is not enabled.
- Return type:
Union[bool, None]
Exploration of results#
- class dataikuapi.dss.ml.DSSTrainedPredictionModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)#
Object to read details of a trained prediction model
Important
Do not create this object directly, use
DSSMLTask.get_trained_model_details()
instead- get_roc_curve_data()#
Gets the data used to plot the ROC curve for the model, if it exists.
- Returns:
A dictionary containing ROC curve data
- get_performance_metrics()#
Returns all performance metrics for this model.
For binary classification model, this includes both “threshold-independent” metrics like AUC and “threshold-dependent” metrics like precision. Threshold-dependent metrics are returned at the threshold value that was found to be optimal during training.
To get access to the per-threshold values, use the following:
# Returns a list of tested threshold values details.get_performance()["perCutData"]["cut"] # Returns a list of F1 scores at the tested threshold values details.get_performance()["perCutData"]["f1"] # Both lists have the same length
If K-Fold cross-test was used, most metrics will have a “std” variant, which is the standard deviation accross the K cross-tested folds. For example, “auc” will be accompanied with “aucstd”
- Returns:
a dict of performance metrics values
- Return type:
dict
- get_assertions_metrics()#
Retrieves assertions metrics computed for this trained model
- Returns:
an object representing assertion metrics
- Return type:
DSSMLAssertionsMetrics
- get_hyperparameter_search_points()#
Gets the list of points in the hyperparameter search space that have been tested.
Returns a list of dict. Each entry in the list represents a point.
- For each point, the dict contains at least:
“score”: the average value of the optimization metric over all the folds at this point
“params”: a dict of the parameters at this point. This dict has the same structure as the params of the best parameters
- get_preprocessing_settings()#
Gets the preprocessing settings that were used to train this model
- Return type:
dict
- get_modeling_settings()#
Gets the modeling (algorithms) settings that were used to train this model.
Note
The structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithms).
- Return type:
dict
- get_actual_modeling_params()#
Gets the actual / resolved parameters that were used to train this model, post hyperparameter optimization.
- Returns:
A dictionary, which contains at least a “resolved” key, which is a dict containing the post-optimization parameters
- Return type:
dict
- get_trees()#
Gets the trees in the model (for tree-based models)
- Returns:
a DSSTreeSet object to interact with the trees
- Return type:
dataikuapi.dss.ml.DSSTreeSet
- get_coefficient_paths()#
Gets the coefficient paths for Lasso models
- Returns:
a DSSCoefficientPaths object to interact with the coefficient paths
- Return type:
dataikuapi.dss.ml.DSSCoefficientPaths
- get_scoring_jar_stream(model_class='model.Model', include_libs=False)#
Returns a stream of a scoring jar for this trained model.
This works provided that you have the license to do so and that the model is compatible with optimized scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Parameters:
model_class (str) – fully-qualified class name, e.g. “com.company.project.Model”
include_libs (bool) – if True, also packs the required dependencies; if False, runtime will require the scoring libs given by
DSSClient.scoring_libs()
- Returns:
a jar file, as a stream
- Return type:
file-like
- get_scoring_pmml_stream()#
Returns a stream of a scoring PMML for this trained model.
This works provided that you have the license to do so and that the model is compatible with PMML scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Returns:
a PMML file, as a stream
- Return type:
file-like
- get_scoring_python_stream()#
Returns a stream of a zip file containing the required data to use this trained model in external python code.
See: https://doc.dataiku.com/dss/latest/python-api/ml.html
This works provided that you have the license to do so and that the model is compatible with Python scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Returns:
an archive file, as a stream
- Return type:
file-like
- get_scoring_python(filename)#
Downloads a zip file containing the required data to use this trained model in external python code.
See: https://doc.dataiku.com/dss/latest/python-api/ml.html
This works provided that you have the license to do so and that the model is compatible with Python scoring.
- Parameters:
filename (str) – filename of the resulting downloaded file
- get_scoring_mlflow_stream()#
Returns a stream of a zip containing this trained model using the MLflow Model format.
This works provided that you have the license to do so and that the model is compatible with MLflow scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Returns:
an archive file, as a stream
- Return type:
file-like
- get_scoring_mlflow(filename)#
Downloads a zip containing data for this trained model, using the MLflow Model format.
This works provided that you have the license to do so and that the model is compatible with MLflow scoring.
- Parameters:
filename (str) – filename to the resulting MLflow Model zip
- export_to_snowflake_function(connection_name, function_name, wait=True)#
Exports the model to a Snowflake function. Only works for Saved Model Versions. :param connection_name: Snowflake connection to use :param function_name: Name of the function to create :param wait: a flag to wait for the operation to complete (defaults to True) :return: None if wait is True, else a future
- export_to_databricks_registry(connection_name, use_unity_catalog, model_name, experiment_name, wait=True)#
- Exports the model as a version of a Registered Model of a Databricks Registry. To do so, the model is exported to the MLflow format, then logged
in a run of an experiment, and finally registered in the selected registry.
- Parameters:
connection_name – Databricks Model Deployment Infrastructure connection to use
use_unity_catalog – exports to a model in the Databricks Workspace registry or in the Databricks Unity Catalog
model_name – name of the model to add a version to. Restrictions apply on possible name ; please refer to Databricks documentation. The model will be created if needed.
experiment_name – name of the experiment to use. The experiment will be created if needed.
wait – a flag to wait for the operation to complete (defaults to True)
- Returns:
dict if wait is True, else a future
- compute_shapley_feature_importance()#
Launches computation of Shapley feature importance for this trained model
- Returns:
A future for the computation task
- Return type:
- compute_subpopulation_analyses(split_by, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)#
Launch computation of Subpopulation analyses for this trained model.
- Parameters:
split_by (list|str) – column(s) on which subpopulation analyses are to be computed (one analysis per column)
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training. (-1 means ‘all cores’)
debug_mode (bool) – if True, output all logs (slower)
- Returns:
if wait is True, an object containing the Subpopulation analyses, else a future to wait on the result
- Return type:
Union[
dataikuapi.dss.ml.DSSSubpopulationAnalyses
,dataikuapi.dss.future.DSSFuture
]
- get_subpopulation_analyses()#
Retrieve all subpopulation analyses computed for this trained model
- Returns:
The subpopulation analyses
- Return type:
dataikuapi.dss.ml.DSSSubpopulationAnalyses
- compute_partial_dependencies(features, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)#
Launch computation of Partial dependencies for this trained model.
- Parameters:
features (list|str) – feature(s) on which partial dependencies are to be computed
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training. (-1 means ‘all cores’)
debug_mode (bool) – if True, output all logs (slower)
- Returns:
if wait is True, an object containing the Partial dependencies, else a future to wait on the result
- Return type:
Union[
dataikuapi.dss.ml.DSSPartialDependencies
,dataikuapi.dss.future.DSSFuture
]
- get_partial_dependencies()#
Retrieve all partial dependencies computed for this trained model
- Returns:
The partial dependencies
- Return type:
dataikuapi.dss.ml.DSSPartialDependencies
- download_documentation_stream(export_id)#
Download a model documentation, as a binary stream.
Warning: this stream will monopolize the DSSClient until closed.
- Parameters:
export_id – the id of the generated model documentation returned as the result of the future
- Returns:
A
DSSFuture
representing the model document generation process
- download_documentation_to_file(export_id, path)#
Download a model documentation into the given output file.
- Parameters:
export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation
- Returns:
None
- property full_id#
- generate_documentation(folder_id=None, path=None)#
Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.
- Parameters:
folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder
- Returns:
A
DSSFuture
representing the model document generation process
- generate_documentation_from_custom_template(fp)#
Start the model document generation from a docx template (as a file object).
- Parameters:
fp (object) – A file-like object pointing to a template docx file
- Returns:
A
DSSFuture
representing the model document generation process
- get_diagnostics()#
Retrieves diagnostics computed for this trained model
- Returns:
list of diagnostics
- Return type:
list of type dataikuapi.dss.ml.DSSMLDiagnostic
- get_origin_analysis_trained_model()#
Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.
- Return type:
DSSTrainedModelDetails | None
- get_raw()#
Gets the raw dictionary of trained model details
- get_raw_snippet()#
Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.
- get_train_info()#
Returns various information about the training process (size of the train set, quick description, timing information)
- Return type:
dict
- get_user_meta()#
Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling
save_user_meta()
- save_user_meta()#
- class dataikuapi.dss.ml.DSSTrainedClusteringModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)#
Object to read details of a trained clustering model
Important
Do not create this class directly, use
DSSMLTask.get_trained_model_details()
- get_raw()#
Gets the raw dictionary of trained model details
- Returns:
A dictionary containing the trained model details
- Return type:
dict
- get_train_info()#
Gets various information about the training process.
This includes information such as the size of the train set, the quick description and timing information etc.
- Returns:
A dictionary containing the models training information
- Return type:
dict
- get_facts()#
Gets the ‘cluster facts’ data.
The cluster facts data is the structure behind the screen “for cluster X, average of Y is Z times higher than average”.
- Returns:
The clustering facts data
- Return type:
DSSClustersFacts
- get_performance_metrics()#
Returns all performance metrics for this clustering model.
- Returns:
A dict of performance metrics values
- Return type:
dict
- download_documentation_stream(export_id)#
Download a model documentation, as a binary stream.
Warning: this stream will monopolize the DSSClient until closed.
- Parameters:
export_id – the id of the generated model documentation returned as the result of the future
- Returns:
A
DSSFuture
representing the model document generation process
- download_documentation_to_file(export_id, path)#
Download a model documentation into the given output file.
- Parameters:
export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation
- Returns:
None
- property full_id#
- generate_documentation(folder_id=None, path=None)#
Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.
- Parameters:
folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder
- Returns:
A
DSSFuture
representing the model document generation process
- generate_documentation_from_custom_template(fp)#
Start the model document generation from a docx template (as a file object).
- Parameters:
fp (object) – A file-like object pointing to a template docx file
- Returns:
A
DSSFuture
representing the model document generation process
- get_diagnostics()#
Retrieves diagnostics computed for this trained model
- Returns:
list of diagnostics
- Return type:
list of type dataikuapi.dss.ml.DSSMLDiagnostic
- get_origin_analysis_trained_model()#
Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.
- Return type:
DSSTrainedModelDetails | None
- get_preprocessing_settings()#
Gets the preprocessing settings that were used to train this model
- Returns:
The model preprocessing settings
- Return type:
dict
- get_raw_snippet()#
Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.
- get_user_meta()#
Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling
save_user_meta()
- save_user_meta()#
- get_modeling_settings()#
Gets the modeling (algorithms) settings that were used to train this model.
Note
The structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithms).
- Returns:
The model modeling settings
- Return type:
dict
- get_actual_modeling_params()#
Gets the actual / resolved parameters that were used to train this model.
- Returns:
A dictionary, which contains at least a “resolved” key
- Return type:
dict
- get_scatter_plots()#
Gets the cluster scatter plot data
- Returns:
a DSSScatterPlots object to interact with the scatter plots
- Return type:
dataikuapi.dss.ml.DSSScatterPlots
Saved models#
- class dataikuapi.dss.savedmodel.DSSSavedModel(client, project_key, sm_id)#
Handle to interact with a saved model on the DSS instance.
Important
Do not create this class directly, instead use
dataikuapi.dss.DSSProject.get_saved_model()
- Parameters:
client (
dataikuapi.dssclient.DSSClient
) – an api client to connect to the DSS backendproject_key (str) – identifier of the project containing the model
sm_id (str) – identifier of the saved model
- property id#
Returns the identifier of the saved model
- Return type:
str
- get_settings()#
Returns the settings of this saved model.
- Returns:
settings of this saved model
- Return type:
- list_versions()#
Gets the versions of this saved model.
This returns each version as a dict of object. Each object contains at least an “id” parameter, which can be passed to
get_metric_values()
,get_version_details()
andset_active_version()
.- Returns:
The list of the versions
- Return type:
list[dict]
- get_active_version()#
Gets the active version of this saved model.
The returned dict contains at least an “id” parameter, which can be passed to
get_metric_values()
,get_version_details()
andset_active_version()
.- Returns:
A dict representing the active version or None if no version is active.
- Return type:
Union[dict, None]
- get_version_details(version_id)#
Gets details for a version of a saved model
- Parameters:
version_id (str) – identifier of the version, as returned by
list_versions()
- Returns:
details of this trained model
- Return type:
- set_active_version(version_id)#
Sets a particular version of the saved model as the active one.
- Parameters:
version_id (str) – Identifier of the version, as returned by
list_versions()
- delete_versions(versions, remove_intermediate=True)#
Deletes version(s) of the saved model.
- Parameters:
versions (list[str]) – list of versions to delete
remove_intermediate (bool) – If True, also removes intermediate versions. In the case of a partitioned model, an intermediate version is created every time a partition has finished training. (defaults to True)
- get_origin_ml_task()#
Fetches the last ML task that has been exported to this saved model.
- Returns:
origin ML task or None if the saved model does not have an origin ml task
- Return type:
Union[
dataikuapi.dss.ml.DSSMLTask
, None]
- import_mlflow_version_from_path(version_id, path, code_env_name='INHERIT', container_exec_config_name='NONE', set_active=True, binary_classification_threshold=0.5)#
Creates a new version for this saved model from a path containing a MLFlow model.
Important
Requires the saved model to have been created using
dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model()
.- Parameters:
version_id (str) – identifier of the version, as returned by
list_versions()
path (str) – absolute path on the local filesystem - must be a folder, and must contain a MLFlow model
code_env_name (str) –
Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks.
If value is “INHERIT”, the default active code env of the project will be used.
(defaults to INHERIT)
container_exec_config_name (str) –
Name of the containerized execution configuration to use for reading the metadata of the model
If value is “INHERIT”, the container execution configuration of the project will be used.
If value is “NONE”, local execution will be used (no container)
(defaults to INHERIT)
set_active (bool) – sets this new version as the active version of the saved model (defaults to True)
binary_classification_threshold (float) – for binary classification, defines the actual threshold for the imported version (defaults to 0.5)
- Returns:
external model version handler in order to interact with the new MLFlow model version
- Return type:
- import_mlflow_version_from_managed_folder(version_id, managed_folder, path, code_env_name='INHERIT', container_exec_config_name='INHERIT', set_active=True, binary_classification_threshold=0.5)#
Creates a new version for this saved model from a managed folder.
Important
Requires the saved model to have been created using
dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model()
.- Parameters:
version_id (str) – identifier of the version, as returned by
list_versions()
managed_folder (
dataikuapi.dss.managedfolder.DSSManagedFolder
or str) – managed folder, or identifier of the managed folderpath (str) – path of the MLflow folder in the managed folder
code_env_name (str) –
Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks.
If value is “INHERIT”, the default active code env of the project will be used.
(defaults to INHERIT)
container_exec_config_name (str) –
Name of the containerized execution configuration to use for reading the metadata of the model
If value is “INHERIT”, the container execution configuration of the project will be used.
If value is “NONE”, local execution will be used (no container)
(defaults to INHERIT)
set_active (bool) – sets this new version as the active version of the saved model (defaults to True)
binary_classification_threshold (float) – for binary classification, defines the actual threshold for the imported version (defaults to 0.5)
- Returns:
external model version handler in order to interact with the new MLFlow model version
- Return type:
- import_mlflow_version_from_databricks(version_id, connection_name, use_unity_catalog, model_name, model_version, code_env_name='INHERIT', container_exec_config_name='INHERIT', set_active=True, binary_classification_threshold=0.5)#
- create_external_model_version(version_id, configuration, target_column_name=None, class_labels=None, set_active=True, binary_classification_threshold=0.5, input_dataset=None, selection=None, use_optimal_threshold=True, skip_expensive_reports=True, features_list=None, container_exec_config_name='NONE', input_format='GUESS', output_format='GUESS', evaluate=True)#
Creates a new version of an external model.
Important
Requires the saved model to have been created using
dataikuapi.dss.project.DSSProject.create_external_model()
.- Parameters:
version_id (str) – Identifier of the version, as returned by
list_versions()
configuration (dict) –
A dictionary containing the desired saved model version configuration.
For SageMaker, syntax is:
configuration = { "protocol": "sagemaker", "endpoint_name": "<endpoint-name>" }
For AzureML, syntax is:
configuration = { "protocol": "azure-ml", "endpoint_name": "<endpoint-name>" }
For Vertex AI, syntax is:
configuration = { "protocol": "vertex-ai", "endpoint_id": "<endpoint-id>" }
For Databricks, syntax is:
configuration = { "protocol": "databricks", "endpointName": "<endpoint-id>" }
target_column_name (str) – Name of the target column. Mandatory if model performance will be evaluated
class_labels (list or None) – List of strings, ordered class labels. Mandatory for evaluation of classification models
set_active (bool) – (optional) Sets this new version as the active version of the saved model (defaults to True)
binary_classification_threshold (float) – (optional) For binary classification, defines the actual threshold for the imported version (defaults to 0.5). Overwritten during evaluation if an evaluation dataset is specified and use_optimal_threshold is True
input_dataset (str or
dataikuapi.dss.dataset.DSSDataset
ordataiku.Dataset
) – (mandatory if either evaluate=True, input_format=GUESS, output_format=GUESS, features_list is None) Dataset to use to infer the features names and types (if features_list is not set), evaluate the model, populate interpretation tabs, and guess input/output formats (if input_format=GUESS or output_format=GUESS).selection (dict or
DSSDatasetSelectionBuilder
or None) –(optional) Sampling parameter for input_dataset during evaluation.
Example 1: head 100 lines
DSSDatasetSelectionBuilder().with_head_sampling(100)
Example 2: random 500 lines
DSSDatasetSelectionBuilder().with_random_fixed_nb_sampling(500)
Example 3: head 100 lines
{"samplingMethod": "HEAD_SEQUENTIAL", "maxRecords": 100}
Defaults to head 100 lines
use_optimal_threshold (bool) – (optional) Set as threshold for this model version the threshold that has been computed during evaluation according to the metric set on the saved model setting (i.e.
prediction_metrics_settings['thresholdOptimizationMetric']
)skip_expensive_reports (bool) – (optional) Skip computation of expensive/slow reports (e.g. feature importance).
features_list (list of
{"name": "feature_name", "type": "feature_type"}
or None) – (optional) List of features, in JSON. Used if input_dataset is not definedcontainer_exec_config_name (str) –
(optional) name of the containerized execution configuration to use for running the evaluation process.
If value is “INHERIT”, the container execution configuration of the project will be used.
If value is “NONE”, local execution will be used (no container)
input_format (str) –
(optional) Input format to use when querying the underlying endpoint. For the ‘azure-ml’ and ‘sagemaker’ protocols, this option must be set if input_dataset is not set. Supported values are:
- For all protocols:
GUESS (default): Guess the input format by cycling through supported input formats and making requests using data from input_dataset.
- For Amazon SageMaker:
INPUT_SAGEMAKER_CSV
INPUT_SAGEMAKER_JSON
INPUT_SAGEMAKER_JSON_EXTENDED
INPUT_SAGEMAKER_JSONLINES
INPUT_DEPLOY_ANYWHERE_ROW_ORIENTED_JSON
- For Vertex AI:
INPUT_VERTEX_DEFAULT
- For Azure Machine Learning:
INPUT_AZUREML_JSON_INPUTDATA
INPUT_AZUREML_JSON_WRITER
INPUT_AZUREML_JSON_INPUTDATA_DATA
INPUT_DEPLOY_ANYWHERE_ROW_ORIENTED_JSON
- For Databricks:
INPUT_RECORD_ORIENTED_JSON
INPUT_SPLIT_ORIENTED_JSON
INPUT_TF_INPUTS_JSON
INPUT_TF_INSTANCES_JSON
INPUT_DATABRICKS_CSV
output_format (str) –
(optional) Output format to use to parse the underlying endpoint’s response. For the ‘azure-ml’ and ‘sagemaker’ protocols, this option must be set if input_dataset is not set. Supported values are:
- For all protocols:
GUESS (default): Guess the output format by cycling through supported output formats and making requests using data from input_dataset.
- For Amazon SageMaker:
OUTPUT_SAGEMAKER_CSV
OUTPUT_SAGEMAKER_ARRAY_AS_STRING
OUTPUT_SAGEMAKER_JSON
OUTPUT_DEPLOY_ANYWHERE_JSON
- For Vertex AI:
OUTPUT_VERTEX_DEFAULT
- For Azure Machine Learning:
OUTPUT_AZUREML_JSON_OBJECT
OUTPUT_AZUREML_JSON_ARRAY
OUTPUT_DEPLOY_ANYWHERE_JSON
- For Databricks:
OUTPUT_DATABRICKS_JSON
evaluate (bool) – (optional) True (default) if this model should be evaluated using input_dataset, False to disable evaluation.
Example: create a SageMaker Saved Model and add an endpoint as a version, evaluated on a dataset:
import dataiku client = dataiku.api_client() project = client.get_default_project() # create a SageMaker saved model, whose endpoints are hosted in region eu-west-1 sm = project.create_external_model("SaveMaker External Model", "BINARY_CLASSIFICATION", {"protocol": "sagemaker", "region": "eu-west-1"}) # configuration to add endpoint configuration = { "protocol": "sagemaker", "endpoint_name": "titanic-survived-endpoint" } smv = sm.create_external_model_version("v0", configuration, target_column_name="Survived", class_labels=["0", "1"], input_dataset="evaluation_dataset")
A dataset named “evaluation_dataset” must exist in the current project. Its schema and content should match the endpoint expectations. Depending on the way the model deployed on the endpoint was created, it may require a certain schema and not accept extra columns, it may not deal with missing features, etc.
Example: create a Vertex AI Saved Model and add an endpoint as a version, without evaluating it:
import dataiku client = dataiku.api_client() project = client.get_default_project() # create a VertexAI saved model, whose endpoints are hosted in region europe-west-1 sm = project.create_external_model("Vertex AI Proxy Model", "BINARY_CLASSIFICATION", {"protocol":"vertex-ai", "region":"europe-west1"}) configuration = { "protocol":"vertex-ai", "project_id": "my-project", "endpoint_id": "123456789012345678" } smv = sm.create_external_model_version("v1", configuration, target_column_name="Survived", class_labels=["0", "1"], input_dataset="titanic")
A dataset named “my_dataset” must exist in the current project. It will be used to infer the schema of the data to submit to the endpoint. As there is no evaluation dataset specified, the interpretation tabs of this model version will be for the most empty. But the model still can be used to score datasets. It can also be evaluated on a dataset by an Evaluation Recipe.
Example: create an AzureML Saved Model
import dataiku client = dataiku.api_client() project = client.get_default_project() # create an Azure ML saved model. No region specified, as this notion does not exist for Azure ML sm = project.create_external_model("Azure ML Proxy Model", "BINARY_CLASSIFICATION", {"protocol": "azure-ml"}) configuration = { "protocol": "azure-ml", "subscription_id": "<subscription-id>>", "resource_group": "<your.resource.group-rg>", "workspace": "<your-workspace>", "endpoint_name": "<endpoint-name>" } features_list = [{'name': 'Pclass', 'type': 'bigint'}, {'name': 'Age', 'type': 'double'}, {'name': 'SibSp', 'type': 'bigint'}, {'name': 'Parch', 'type': 'bigint'}, {'name': 'Fare', 'type': 'double'}] smv = sm.create_external_model_version("20230324-in-prod", configuration, target_column_name="Survived", class_labels=["0", "1"], features_list=features_list)
Example: minimalistic creation of a VertexAI model binary classification model
import dataiku client = dataiku.api_client() project = client.get_default_project() sm = project.create_external_model("Raw Vertex AI Proxy Model", "BINARY_CLASSIFICATION", {"protocol": "vertex-ai", "region": "europe-west1"}) configuration = { "protocol": "vertex-ai", "project_id": "my-project", "endpoint_id": "123456789012345678" } smv = sm.create_external_model_version("legacy-model", configuration, class_labels=["0", "1"])
This model will have empty interpretation tabs and can not be evaluated later by an Evaluation Recipe, as its target is not defined, but it can be scored.
Example: create a Databricks Saved Model
import dataiku client = dataiku.api_client() project = client.get_default_project() sm = project.create_external_model("Databricks External Model", "BINARY_CLASSIFICATION", {"protocol": "databricks","connection": "db"}) smv = sm.create_external_model_version("vX", {"protocol": "databricks", "endpointName": "<endpoint-name>"}, target_column_name="Survived", class_labels=["0", "1"], input_dataset="train_titanic_prepared")
- get_external_model_version_handler(version_id)#
Returns a handler to interact with an external model version (MLflow or Proxy model)
- Parameters:
version_id (str) – identifier of the version, as returned by
list_versions()
- Returns:
external model version handler
- Return type:
- get_metric_values(version_id)#
Gets the values of the metrics on the specified version of this saved model
- Parameters:
version_id (str) – identifier of the version, as returned by
list_versions()
- Returns:
a list of metric objects and their value
- Return type:
list
- get_zone()#
Gets the flow zone of this saved model
- Returns:
the saved model’s flow zone
- Return type:
- move_to_zone(zone)#
Moves this object to a flow zone
- Parameters:
zone (
dataikuapi.dss.flow.DSSFlowZone
) – flow zone where the object should be moved
Shares this object to a flow zone
- Parameters:
zone (
dataikuapi.dss.flow.DSSFlowZone
) – flow zone where the object should be shared
Unshares this object from a flow zone
- Parameters:
zone (
dataikuapi.dss.flow.DSSFlowZone
) – flow zone from which the object shouldn’t be shared
- get_usages()#
Gets the recipes referencing this model
- Returns:
a list of usages
- Return type:
list
- get_object_discussions()#
Gets a handle to manage discussions on the saved model
- Returns:
the handle to manage discussions
- Return type:
- delete()#
Deletes the saved model
- class dataikuapi.dss.savedmodel.DSSSavedModelSettings(saved_model, settings)#
Handle on the settings of a saved model.
Important
Do not create this class directly, instead use
dataikuapi.dss.DSSSavedModel.get_settings()
- Parameters:
saved_model (
dataikuapi.dss.savedmodel.DSSSavedModel
) – the saved model objectsettings (dict) – the settings of the saved model
- get_raw()#
Returns the raw settings of the saved model
- Returns:
the raw settings of the saved model
- Return type:
dict
- property prediction_metrics_settings#
Returns the metrics-related settings
- Return type:
dict
- save()#
Saves the settings of this saved model
- class dataiku.core.saved_model.SavedModelVersionMetrics(metrics, version_id)#
Handle to the metrics of a version of a saved model
- get_performance_values()#
Retrieve the metrics as a dict.
- Return type:
dict
- get_computed()#
Get the underlying metrics object.
- Return type:
- class dataiku.Model(lookup, project_key=None, ignore_flow=False)#
Handle to interact with a saved model.
Note
This class is also available as
dataiku.Model
- Parameters:
lookup (string) – name or identifier of the saved model
project_key (string) – project key of the saved model, if it is not in the current project. (defaults to None, i.e. current project)
ignore_flow (boolean) – if True, create the handle regardless of whether the saved model is an input or output of the recipe (defaults to False)
- static list_models(project_key=None)#
Retrieves the list of saved models of the given project.
- Parameters:
project_key (str) – key of the project from which to list models. (defaults to None, i.e. current project)
- Returns:
a list of the saved models of the project, as dict. Each dict contains at least the following fields:
id: identifier of the saved model
name: name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)
backendType: backend type of the saved model (PY_MEMORY / KERAS / MLLIB / H2O / DEEP_HUB)
versionsCount: number of versions in the saved model
- Return type:
list[dict]
- get_info()#
Gets the model information.
- Returns:
the model information. Fields are:
id : identifier of the saved model
projectKey : project key of the saved model
name : name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)
- Return type:
dict
- get_id()#
Gets the identifier of the model.
- Return type:
str
- get_name()#
Gets the name of the model
- Return type:
str
- get_type()#
Gets the type of the model.
- Returns:
the model type (PREDICTION / CLUSTERING)
- Return type:
str
- get_definition()#
Gets the model definition.
- Return type:
dict
- list_versions()#
Lists the model versions.
Note
The
versionId
field can be used to call theactivate_version()
method.- Returns:
Information about versions of the saved model, as a list of dict. Fields are:
versionId: identifier of the model version
active: whether this version is active or not
snippet: detailed dict containing version information
- Return type:
list[dict]
- activate_version(version_id)#
Activates the given version of the model.
- Parameters:
version_id (str) – the identifier of the version to activate
- get_version_metrics(version_id)#
Gets the training metrics of a given version of the model.
- Parameters:
version_id (str) – the identifier of the version from which to retrieve metrics
- Return type:
- get_version_checks(version_id)#
Gets the training checks of the given version of the model.
- Parameters:
version_id (str) – the identifier of the version from which to retrieve checks
- Return type:
- save_external_check_values(values_dict, version_id)#
Saves checks on the model, the checks are saved with the type “external”.
- Parameters:
values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names
version_id (str) – the identifier of the version for which checks should be saved
- Return type:
dict
- get_predictor(version_id=None)#
Returns a
Predictor
for the given version of the model.Note
This predictor can then be used to preprocess and make predictions on a dataframe.
- Parameters:
version_id (str) – the identifier of the version from which to build the predictor (defaults to None, current active version)
- Returns:
The predictor built from the given version of this model
- Return type:
dataiku.core.saved_model.Predictor
- create_finetuned_llm_version(connection_name, quantization=None, set_active=True)#
Creates a new fine-tuned LLM version, using a context manager (experimental)
Upon exit of the context manager, the new model version is made available with the content of the working directory. The model weights must use the safetensors format. This model will be loaded at inference time with trust_remote_code=False.
Simple usage:
with saved_model.create_finetuned_llm_version("MyLocalHuggingfaceConnection") as finetuned_llm_version: # write model files to finetuned_llm_version.working_directory # the new version is now available
- Parameters:
connection_name (str) – name of the connection to link this version
quantization (str) – quantization mode, must be one of [None, “Q_4BIT”, “Q_8BIT”] (default: None)
set_active (bool) – if True, set the new version as active for this saved model (default: True)
- Returns:
yields a
FinetunedLLMVersionTrainingParameters
object
MLflow models#
- class dataikuapi.dss.savedmodel.ExternalModelVersionHandler(saved_model, version_id)#
Handler to interact with an External model version (MLflow import of Proxy model).
Important
Do not create this class directly, instead use
dataikuapi.dss.savedmodel.DSSSavedModel.get_external_model_version_handler()
- Parameters:
saved_model (
dataikuapi.dss.savedmodel.DSSSavedModel
) – the saved model objectversion_id (str) – identifier of the version, as returned by
dataikuapi.dss.savedmodel.DSSSavedModel.list_versions()
- get_settings()#
Returns the settings of the MLFlow model version
- Returns:
settings of the MLFlow model version
- Return type:
- set_core_metadata(target_column_name, class_labels=None, get_features_from_dataset=None, features_list=None, container_exec_config_name='NONE')#
Sets metadata for this MLFlow model version
In addition to
target_column_name
, one ofget_features_from_dataset
orfeatures_list
must be passed in order to be able to evaluate performance- Parameters:
target_column_name (str) – name of the target column. Mandatory in order to be able to evaluate performance
class_labels (list or None) – List of strings, ordered class labels. Mandatory in order to be able to evaluate performance on classification models
get_features_from_dataset (str or None) – name of a dataset to get feature names from
features_list (list or None) – list of
{"name": "feature_name", "type": "feature_type"}
container_exec_config_name (str) –
name of the containerized execution configuration to use for running the evaluation process.
If value is “INHERIT”, the container execution configuration of the project will be used.
If value is “NONE”, local execution will be used (no container)
(defaults to None)
- evaluate(dataset_ref, container_exec_config_name='INHERIT', selection=None, use_optimal_threshold=True, skip_expensive_reports=True)#
Evaluates the performance of this model version on a particular dataset. After calling this, the “result screens” of the MLFlow model version will be available (confusion matrix, error distribution, performance metrics, …) and more information will be available when calling:
dataikuapi.dss.savedmodel.DSSSavedModel.get_version_details()
Evaluation is available only for models having BINARY_CLASSIFICATION, MULTICLASS or REGRESSION as prediction type. See
DSSProject.create_mlflow_pyfunc_model()
.Important
set_core_metadata()
must be called before you can evaluate a dataset- Parameters:
dataset_ref (str or
dataikuapi.dss.dataset.DSSDataset
ordataiku.Dataset
) – Evaluation dataset to usecontainer_exec_config_name (str) –
Name of the containerized execution configuration to use for running the evaluation process.
If value is “INHERIT”, the container execution configuration of the project will be used.
If value is “NONE”, local execution will be used (no container)
(defaults to INHERIT)
selection (dict or
DSSDatasetSelectionBuilder
or None) –Sampling parameter for the evaluation.
Example 1:
DSSDatasetSelectionBuilder().with_head_sampling(100)
Example 2:
{"samplingMethod": "HEAD_SEQUENTIAL", "maxRecords": 100}
(defaults to None)
use_optimal_threshold (bool) – Choose between optimized or actual threshold. Optimized threshold has been computed according to the metric set on the saved model setting (i.e.
prediction_metrics_settings['thresholdOptimizationMetric']
) (defaults to True)skip_expensive_reports (boolean) – Skip expensive/slow reports (e.g. feature importance).
- class dataikuapi.dss.savedmodel.MLFlowVersionSettings(version_handler, data)#
Handle for the settings of an imported MLFlow model version.
Important
Do not create this class directly, instead use
dataikuapi.dss.savedmodel.ExternalModelVersionHandler.get_settings()
- Parameters:
version_handler (
dataikuapi.dss.savedmodel.ExternalModelVersionHandler
) – handler to interact with an external model versiondata (dict) – raw settings of the imported MLFlow model version
- property raw#
- Returns:
The raw settings of the imported MLFlow model version
- Return type:
dict
- save()#
Saves the settings of this MLFlow model version
dataiku.Model#
- class dataiku.Model(lookup, project_key=None, ignore_flow=False)
Handle to interact with a saved model.
Note
This class is also available as
dataiku.Model
- Parameters:
lookup (string) – name or identifier of the saved model
project_key (string) – project key of the saved model, if it is not in the current project. (defaults to None, i.e. current project)
ignore_flow (boolean) – if True, create the handle regardless of whether the saved model is an input or output of the recipe (defaults to False)
- static list_models(project_key=None)
Retrieves the list of saved models of the given project.
- Parameters:
project_key (str) – key of the project from which to list models. (defaults to None, i.e. current project)
- Returns:
a list of the saved models of the project, as dict. Each dict contains at least the following fields:
id: identifier of the saved model
name: name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)
backendType: backend type of the saved model (PY_MEMORY / KERAS / MLLIB / H2O / DEEP_HUB)
versionsCount: number of versions in the saved model
- Return type:
list[dict]
- get_info()
Gets the model information.
- Returns:
the model information. Fields are:
id : identifier of the saved model
projectKey : project key of the saved model
name : name of the saved model
type: type of saved model (CLUSTERING / PREDICTION)
- Return type:
dict
- get_id()
Gets the identifier of the model.
- Return type:
str
- get_name()
Gets the name of the model
- Return type:
str
- get_type()
Gets the type of the model.
- Returns:
the model type (PREDICTION / CLUSTERING)
- Return type:
str
- get_definition()
Gets the model definition.
- Return type:
dict
- list_versions()
Lists the model versions.
Note
The
versionId
field can be used to call theactivate_version()
method.- Returns:
Information about versions of the saved model, as a list of dict. Fields are:
versionId: identifier of the model version
active: whether this version is active or not
snippet: detailed dict containing version information
- Return type:
list[dict]
- activate_version(version_id)
Activates the given version of the model.
- Parameters:
version_id (str) – the identifier of the version to activate
- get_version_metrics(version_id)
Gets the training metrics of a given version of the model.
- Parameters:
version_id (str) – the identifier of the version from which to retrieve metrics
- Return type:
- get_version_checks(version_id)
Gets the training checks of the given version of the model.
- Parameters:
version_id (str) – the identifier of the version from which to retrieve checks
- Return type:
- save_external_check_values(values_dict, version_id)
Saves checks on the model, the checks are saved with the type “external”.
- Parameters:
values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names
version_id (str) – the identifier of the version for which checks should be saved
- Return type:
dict
- get_predictor(version_id=None)
Returns a
Predictor
for the given version of the model.Note
This predictor can then be used to preprocess and make predictions on a dataframe.
- Parameters:
version_id (str) – the identifier of the version from which to build the predictor (defaults to None, current active version)
- Returns:
The predictor built from the given version of this model
- Return type:
dataiku.core.saved_model.Predictor
- create_finetuned_llm_version(connection_name, quantization=None, set_active=True)
Creates a new fine-tuned LLM version, using a context manager (experimental)
Upon exit of the context manager, the new model version is made available with the content of the working directory. The model weights must use the safetensors format. This model will be loaded at inference time with trust_remote_code=False.
Simple usage:
with saved_model.create_finetuned_llm_version("MyLocalHuggingfaceConnection") as finetuned_llm_version: # write model files to finetuned_llm_version.working_directory # the new version is now available
- Parameters:
connection_name (str) – name of the connection to link this version
quantization (str) – quantization mode, must be one of [None, “Q_4BIT”, “Q_8BIT”] (default: None)
set_active (bool) – if True, set the new version as active for this saved model (default: True)
- Returns:
yields a
FinetunedLLMVersionTrainingParameters
object
- class dataiku.core.saved_model.Predictor(params, preprocessing, features, clf)
Object allowing to preprocess and make predictions on a dataframe.
- get_features()
Returns the feature names generated by this predictor’s preprocessing
- predict(df, with_input_cols=False, with_prediction=True, with_probas=True, with_conditional_outputs=False, with_proba_percentile=False, with_explanations=False, explanation_method='ICE', n_explanations=3, n_explanations_mc_steps=100, **kwargs)
Predict a dataframe. The results are returned as a dataframe with columns corresponding to the various prediction information.
- Parameters:
with_input_cols – whether the input columns should also be present in the output
with_prediction – whether the prediction column should be present
with_probas – whether the probability columns should be present
with_conditional_outputs – whether the conditional outputs for this model should be present (binary classif)
with_proba_percentile – whether the percentile of the probability should be present (binary classif)
with_explanations – whether explanations should be computed for each prediction
explanation_method – method to compute the explanations
n_explanations – number of explanations to output for each prediction
n_explanations_mc_steps – number of Monte Carlo steps for SHAPLEY method (higher means more precise but slower)
- preformat(df)
Formats data originating from json (api node, interactive scoring) so that it’s compatible with preprocess
- preprocess(df)
Preprocess a dataframe. The results are returned as a numpy 2-dimensional matrix (which may be sparse). The columns of this matrix correspond to the generated features, which can be listed by the get_features property of this Predictor.
- get_preprocessing()
Algorithm details#
This section documents which algorithms are available, and some of the settings for them.
These algorithm names can be used for dataikuapi.dss.ml.DSSMLTaskSettings.get_algorithm_settings()
and dataikuapi.dss.ml.DSSMLTaskSettings.set_algorithm_enabled()
Note
This documentation does not cover all settings of all algorithms. To know which settings are
available for an algorithm, use mltask_settings.get_algorithm_settings('ALGORITHM_NAME')
and print the returned dictionary.
Generally speaking, most algorithm settings which are arrays means that this parameter can be grid-searched. All values will be tested as part of the hyperparameter optimization.
For more documentation of settings, please refer to the UI of the visual machine learning, which contains detailed documentation for all algorithm parameters
LOGISTIC_REGRESSION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:
{
"multi_class": SingleCategoryHyperparameterSettings, # accepted valued: ['multinomial', 'ovr']
"penalty": CategoricalHyperparameterSettings, # possible values: ["l1", "l2"]
"C": NumericalHyperparameterSettings, # scaling: "LOGARITHMIC"
"n_jobs": 2
}
RANDOM_FOREST_CLASSIFICATION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:
{
"n_estimators": NumericalHyperparameterSettings, # scaling: "LINEAR"
"min_samples_leaf": NumericalHyperparameterSettings, # scaling: "LINEAR"
"max_tree_depth": NumericalHyperparameterSettings, # scaling: "LINEAR"
"max_feature_prop": NumericalHyperparameterSettings, # scaling: "LINEAR"
"max_features": NumericalHyperparameterSettings, # scaling: "LINEAR"
"selection_mode": SingleCategoryHyperparameterSettings, # accepted_values=['auto', 'sqrt', 'log2', 'number', 'prop']
"n_jobs": 4
}
RANDOM_FOREST_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
Main parameters: same as RANDOM_FOREST_CLASSIFICATION
EXTRA_TREES#
Type: Prediction (all kinds)
Available on backend: PY_MEMORY
RIDGE_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
LASSO_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
LEASTSQUARE_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
SVC_CLASSIFICATION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
SVM_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
SGD_CLASSIFICATION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
SGD_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
GBT_CLASSIFICATION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
GBT_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
DECISION_TREE_CLASSIFICATION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
DECISION_TREE_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
LIGHTGBM_CLASSIFICATION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
LIGHTGBM_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
XGBOOST_CLASSIFICATION#
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
XGBOOST_REGRESSION#
Type: Prediction (regression)
Available on backend: PY_MEMORY
NEURAL_NETWORK#
Type: Prediction (all kinds)
Available on backend: PY_MEMORY
KNN#
Type: Prediction (all kinds)
Available on backend: PY_MEMORY
LARS#
Type: Prediction (all kinds)
Available on backend: PY_MEMORY
MLLIB_LOGISTIC_REGRESSION#
Type: Prediction (binary or multiclass)
Available on backend: MLLIB
MLLIB_DECISION_TREE#
Type: Prediction (all kinds)
Available on backend: MLLIB
MLLIB_RANDOM_FOREST#
Type: Prediction (all kinds)
Available on backend: MLLIB
MLLIB_GBT#
Type: Prediction (all kinds)
Available on backend: MLLIB
MLLIB_LINEAR_REGRESSION#
Type: Prediction (regression)
Available on backend: MLLIB
MLLIB_NAIVE_BAYES#
Type: Prediction (all kinds)
Available on backend: MLLIB
Other#
SCIKIT_MODEL
MLLIB_CUSTOM