Recipes#

For usage information and examples, see Recipes

class dataikuapi.dss.recipe.DSSRecipe(client, project_key, recipe_name)#

A handle to an existing recipe on the DSS instance.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.get_recipe()

property id#

Get the identifier of the recipe.

For recipes, the name is the identifier.

Return type:: string

property name#

Get the name of the recipe.

Return type:: string

compute_schema_updates()#

Computes which updates are required to the outputs of this recipe.

This method only computes which changes would be needed to make the schema of the outputs of the recipe match the actual schema that the recipe will produce. To effectively apply these changes to the outputs, you can use the apply() on the returned object.

Note

Not all recipe types can compute automatically the schema of their outputs. Code recipes like Python recipes, notably can’t. This method raises an exception in these cases.

Usage example:

required_updates = recipe.compute_schema_updates()
if required_updates.any_action_required():
    print("Some schemas will be updated")

# Note that you can call apply even if no changes are required. This will be noop
required_updates.apply()

Returns:: an object containing the required updates
Return type:: RequiredSchemaUpdates

run(job_type='NON_RECURSIVE_FORCED_BUILD', partitions=None, wait=True, no_fail=False)#

Starts a new job to run this recipe and wait for it to complete.

Raises if the job failed.

job = recipe.run()
print("Job %s done" % job.id)

Parameters:

job_type (string) – job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD
partitions (string) – if the outputs are partitioned, a partition spec. A spec is a comma-separated list of partition identifiers, and a partition identifier is a pipe-separated list of values for the partitioning dimensions
no_fail (boolean) – if True, does not raise if the job failed
wait (boolean) – if True, the method waits for the job completion. If False, the method returns immediately

Returns:

a job handle corresponding to the recipe run

Return type:

dataikuapi.dss.job.DSSJob

delete()#: Delete the recipe.

rename(new_name)#

Rename the recipe with the new specified name

Parameters:: new_name (str) – the new name of the recipe

get_settings()#

Get the settings of the recipe, as a DSSRecipeSettings or one of its subclasses.

Some recipes have a dedicated class for the settings, with additional helpers to read and modify the settings

Once you are done modifying the returned settings object, you can call save() on it in order to save the modifications to the DSS recipe.

Return type:: DSSRecipeSettings or a subclass

get_definition_and_payload()#

Get the definition of the recipe.

Attention

Deprecated. Use get_settings()

Returns:: an object holding both the raw definition of the recipe (the type, which inputs and outputs, engine settings…) and the payload (SQL script, Python code, join definition,… depending on type)
Return type:: DSSRecipeDefinitionAndPayload

set_definition_and_payload(definition)#

Set the definition of the recipe.

Attention

Deprecated. Use get_settings() then DSSRecipeSettings.save()

Important

The definition parameter should come from a call to get_definition()

Parameters:: definition (object) – a recipe definition, as returned by get_definition()

get_status()#

Gets the status of this recipe.

The status of a recipe is made of messages from checks performed by DSS on the recipe, of messages related to engines availability for the recipe, of messages about testing the recipe on the engine, …

Returns:: an object to interact with the status
Return type:: dataikuapi.dss.recipe.DSSRecipeStatus

get_metadata()#

Get the metadata attached to this recipe.

The metadata contains label, description checklists, tags and custom metadata of the recipe

Returns:

the metadata as a dict, with fields:

label : label of the object (not defined for recipes)
description : description of the object (not defined for recipes)
checklists : checklists of the object, as a dict with a checklists field, which is a list of checklists, each a dict.
tags : list of tags, each a string
custom : custom metadata, as a dict with a kv field, which is a dict with any contents the user wishes
customFields : dict of custom field info (not defined for recipes)

Return type:

dict

set_metadata(metadata)#

Set the metadata on this recipe.

Important

You should only set a metadata object that has been retrieved using get_metadata().

Params dict metadata:: the new state of the metadata for the recipe.

get_object_discussions()#

Get a handle to manage discussions on the recipe.

Returns:: the handle to manage discussions
Return type:: dataikuapi.dss.discussion.DSSObjectDiscussions

get_continuous_activity()#

Get a handle on the associated continuous activity.

Note

Should only be used on continuous recipes.

Return type:: dataikuapi.dss.continuousactivity.DSSContinuousActivity

move_to_zone(zone)#

Move this object to a flow zone.

Parameters:: zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to move the object

class dataikuapi.dss.recipe.DSSRecipeListItem(client, data)#

An item in a list of recipes.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.list_recipes()

to_recipe()#

Gets a handle corresponding to this recipe.

Return type:: DSSRecipe

property name#

Get the name of the recipe.

Return type:: string

property id#

Get the identifier of the recipe.

For recipes, the name is the identifier.

Return type:: string

property type#

Get the type of the recipe.

Returns:: a recipe type, for example ‘sync’ or ‘join’
Return type:: string

property tags#

class dataikuapi.dss.recipe.DSSRecipeStatus(client, data)#

Status of a recipe.

Important

Do not instantiate directly, use DSSRecipe.get_status()

get_selected_engine_details()#

Get the selected engine for this recipe.

This method will raise if there is no selected engine, whether it’s because the present recipe type has no notion of engine, or because DSS couldn’t find any viable engine for running the recipe.

Returns:: a dict of the details of the selected engine. The type of engine is in a type field. Depending on the type, additional field will give more details, like whether some aggregations are possible.
Return type:: dict

get_engines_details()#

Get details about all possible engines for this recipe.

This method will raise if there is no engine, whether it’s because the present recipe type has no notion of engine, or because DSS couldn’t find any viable engine for running the recipe.

Returns:: a list of dict of the details of each possible engine. See get_selected_engine_details() for the fields of each dict.
Return type:: list[dict]

get_status_severity()#

Get the overall status of the recipe.

This is the final result of checking the different parts of the recipe, and depends on the recipe type. Examples of checks done include:

checking the validity of the formulas in computed columns or filters
checking if some of the input columns retrieved by joins overlap
checking against the SQL database if the generated SQL is valid

Returns:: SUCCESS, WARNING, ERROR or INFO. None if the status has no message at all.
Return type:: string

get_status_messages(as_objects=False)#

Returns status messages for this recipe.

Parameters:

as_objects (boolean) – if True, return a list of dataikuapi.dss.utils.DSSInfoMessage. If False, as a list of raw dicts.

Returns:

if as_objects is True, a list of dataikuapi.dss.utils.DSSInfoMessage, otherwise a list of message information, each one a dict of:

severity : severity of the error in the message. Possible values are SUCCESS, INFO, WARNING, ERROR

isFatal : for ERROR severity, whether the error is considered fatal to the operation

code : a string with a well-known code documented in DSS doc

title : short message

message : the error message

details : a more detailed error description

Return type:

list

class dataikuapi.dss.recipe.RequiredSchemaUpdates(recipe, data)#

Handle on a set of required updates to the schema of the outputs of a recipe.

Important

Do not instantiate directly, use DSSRecipe.compute_schema_updates()

For example, changes can be new columns in the output of a Group recipe when new aggregates are activated in the recipe’s settings.

any_action_required()#

Whether there are changes at all.

Return type:: boolean

apply()#

Apply the changes.

All the updates found to be required are applied, for each of the recipe’s outputs.

Settings#

class dataikuapi.dss.recipe.DSSRecipeSettings(recipe, data)#

Settings of a recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

save()#: Save back the recipe in DSS.

property type#

Get the type of the recipe.

Returns:: a type, like ‘sync’, ‘python’ or ‘join’
Return type:: string

property str_payload#

The raw “payload” of the recipe.

This is exactly the data persisted on disk.

Returns:: for code recipes, the payload will be the script of the recipe. For visual recipes, the payload is a JSON of settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
Return type:: string

property obj_payload#

The “payload” of the recipe, parsed from JSON.

Note

Do not use on code recipes, their payload isn’t JSON-encoded.

Returns:: settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
Return type:: dict

property raw_params#

The non-payload settings of the recipe.

Returns:: recipe type-specific settings that aren’t stored in the payload. Typically this comprises engine settings.
Return type:: dict

get_recipe_raw_definition()#

Get the recipe definition.

Returns:

the part of the recipe’s settings that aren’t stored in the payload, as a dict. Notable fields are:

name and projectKey : identifiers of the recipe
type : type of the recipe
params : type-specific parameters of the recipe (on top of what is in the payload)
inputs : input roles to the recipe, as a dict of role name to role, where a role is a dict with an items field consisting of a list of one dict per input object. Each individual input has fields:
- ref : a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
- deps : for partitioned inputs, a list of partition dependencies mapping output dimensions to dimensions in this input. Each partition dependency is a dict.
outputs : output roles to the recipe, as a dict of role name to role, where a role is a dict with a items field consisting of a list of one dict per output object. Each individual output has fields:
- ref : a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
- appendMode : if True, the recipe should append into the output; if False, the recipe should overwrite the output when running

Return type:

dict

get_recipe_inputs()#

Get the inputs to this recipe.

Return type:: dict

get_recipe_outputs()#

Get the outputs of this recipe.

Return type:: dict

get_recipe_params()#

The non-payload settings of the recipe.

Returns:: recipe type-specific settings that aren’t stored in the payload. Typically this comprises engine settings.
Return type:: dict

get_payload()#

The raw “payload” of the recipe.

This is exactly the data persisted on disk.

Returns:: for code recipes, the payload will be the script of the recipe. For visual recipes, the payload is a JSON of settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
Return type:: string

get_json_payload()#

The “payload” of the recipe, parsed from JSON.

Note

Do not use on code recipes, their payload isn’t JSON-encoded.

Returns:: settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
Return type:: dict

set_payload(payload)#

Set the payload of this recipe.

Parameters:: payload (string) – the payload, as a string

set_json_payload(payload)#

Set the payload of this recipe.

Parameters:: payload (dict) – the payload, as a dict. Will be converted to JSON internally.

has_input(input_ref)#

Whether a ref is part of the recipe’s inputs.

Parameters:: input_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
Return type:: boolean

has_output(output_ref)#

Whether a ref is part of the recipe’s outputs.

Parameters:: output_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
Return type:: boolean

replace_input(current_input_ref, new_input_ref)#

Replaces an input of this recipe by another.

If the current_input_ref isn’t part of the recipe’s inputs, this method has no effect.

Parameters:

current_input_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that is currently input to the recipe
new_input_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that current_input_ref should be replaced with.

replace_output(current_output_ref, new_output_ref)#

Replaces an output of this recipe by another.

If the current_output_ref isn’t part of the recipe’s outputs, this method has no effect.

Parameters:

current_output_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that is currently output to the recipe
new_output_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that current_output_ref should be replaced with.

add_input(role, ref, partition_deps=None)#

Add an input to the recipe.

For most recipes, there is only one role, named “main”. Some few recipes have additional roles, like scoring recipes which have a “model” role. Check the roles known to the recipe with get_recipe_inputs().

Parameters:

role (string) – name of the role of the recipe in which to add ref as input
ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id
partition_deps (list) – if ref points to a partitioned object, a list of partition dependencies, one per dimension in the partitioning scheme

add_output(role, ref, append_mode=False)#

Add an output to the recipe.

For most recipes, there is only one role, named “main”. Some few recipes have additional roles, like evaluation recipes which have a “metrics” role. Check the roles known to the recipe with get_recipe_outputs().

Parameters:

role (string) – name of the role of the recipe in which to add ref as input
ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id
partition_deps (list) – if ref points to a partitioned object, a list of partition dependencies, one per dimension in the partitioning scheme

get_flat_input_refs()#

List all input refs of this recipe, regardless of the input role.

Returns:: a list of refs, i.e. of dataset names or managed folder ids or saved model ids
Return type:: list[string]

get_flat_output_refs()#

List all output refs of this recipe, regardless of the input role.

Returns:: a list of refs, i.e. of dataset names or managed folder ids or saved model ids
Return type:: list[string]

property custom_fields#: The custom fields of the object as a dict. Returns None if there are no custom fields

property description#: The description of the object as a string

property short_description#: The short description of the object as a string

property tags#: The tags of the object, as a list of strings

class dataikuapi.dss.recipe.DSSRecipeDefinitionAndPayload(recipe, data): Settings of a recipe.

Note

Deprecated. Alias to DSSRecipeSettings, use DSSRecipe.get_settings() instead.

class dataikuapi.dss.recipe.CodeRecipeSettings(recipe, data)#

Settings of a code recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

get_code()#

Get the code of the recipe.

Return type:: string

set_code(code)#

Update the code of the recipe.

Parameters:: code (string) – the new code

get_code_env_settings()#

Get the code env settings for this recipe.

Returns:

settings to select the code env used by the recipe, as a dict of:

envMode : one of USE_BUILTIN_MODE, INHERIT (inherit from project settings and/or instance settings), EXPLICIT_ENV
envName : if envMode is EXPLICIT_ENV, the name of the code env to use

Return type:

dict

set_code_env(code_env=None, inherit=False, use_builtin=False)#

Set which code env this recipe uses.

Exactly one of code_env, inherit or use_builtin must be passed.

Parameters:

code_env (string) – name of a code env
inherit (boolean) – if True, use the project’s default code env
use_builtin (boolean) – if true, use the builtin code env

class dataikuapi.dss.recipe.SyncRecipeSettings(recipe, data)#: Settings of a Sync recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.PrepareRecipeSettings(recipe, data)#

Settings of a Prepare recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

property raw_steps#

Get the list of the steps of this prepare recipe.

This method returns a reference to the list of steps, not a copy. Modifying the list then calling DSSRecipeSettings.save() commits the changes.

Returns:

list of steps, each step as a dict. The precise settings for each step are not documented, but each dict has at least fields:

metaType : one of PROCESSOR or GROUP. If GROUP, there there is a field steps with a sub-list of steps.

type : type of the step, for example FillEmptyWithValue or ColumnRenamer (there are many types of steps)

params : dict of the step’s own parameters. Each step type has its own parameters.

disabled : whether the step is disabled

name : label of the step

Return type:

list[dict]

add_processor_step(type, params)#

Add a step in the script.

Parameters:

type (string) – type of the step, for example FillEmptyWithValue or ColumnRenamer (there are many types of steps)
params (dict) – dict of the step’s own parameters. Each step type has its own parameters.

class dataikuapi.dss.recipe.SamplingRecipeSettings(recipe, data)#: Settings of a sampling recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.GroupingRecipeSettings(recipe, data)#

Settings of a grouping recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

clear_grouping_keys()#: Clear all grouping keys.

add_grouping_key(column)#

Adds grouping on a column.

Parameters:: column (string) – column to group on

set_global_count_enabled(enabled)#

Activate computing the count of records per group.

Parameters:: enabled (boolean) – True if the global count should be activated

get_or_create_column_settings(column)#

Get a dict representing the aggregations to perform on a column.

If the column has no aggregation on it yet, the dict is created and added to the settings.

Parameters:: column (string) – name of the column to aggregate on
Returns:: the settings of the aggregations on a particular column, as a dict. The name of the column to perform aggregates on is in a column field, and the aggregates are toggled on or off with boolean fields.
Return type:: dict

set_column_aggregations(column, type=None, min=False, max=False, count=False, count_distinct=False, sum=False, concat=False, stddev=False, avg=False)#

Set the basic aggregations on a column.

Note

Not all aggregations may be possible. For example string-typed columns don’t have a mean or standard deviation, and some SQL databases can’t compute the exact standard deviation.

The method returns a reference to the settings of the column, not a copy. Modifying the dict returned by the method, then calling DSSRecipeSettings.save() will commit the changes.

Usage example:

# activate the concat aggregate on a column, and set optional parameters
# pertaining to concatenation
settings = recipe.get_settings()
column_settings = settings.set_column_aggregations("my_column_name", concat=True)
column_settings["concatDistinct"] = True
column_settings["concatSeparator"] = ', '
settings.save()

Parameters:

column (string) – The column name
type (string) – The type of the column (as a DSS schema type name)
min (boolean) – whether the min aggregate is computed
max (boolean) – whether the max aggregate is computed
count (boolean) – whether the count aggregate is computed
count_distinct (boolean) – whether the count distinct aggregate is computed
sum (boolean) – whether the sum aggregate is computed
concat (boolean) – whether the concat aggregate is computed
avg (boolean) – whether the mean aggregate is computed
stddev (boolean) – whether the standard deviation aggregate is computed

Returns:

the settings of the aggregations on a the column, as a dict. The name of the column is in a column field.

Return type:

dict

class dataikuapi.dss.recipe.SortRecipeSettings(recipe, data)#: Settings of a Sort recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.TopNRecipeSettings(recipe, data)#: Settings of a TopN recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.DistinctRecipeSettings(recipe, data)#: Settings of a Distinct recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.PivotRecipeSettings(recipe, data)#: Settings of a Pivot recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.WindowRecipeSettings(recipe, data)#: Settings of a Window recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.JoinRecipeSettings(recipe, data)#

Settings of a join recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

In order to enable self-joins, join recipes are based on a concept of “virtual inputs”. Every join, computed pre-join column, pre-join filter, … is based on one virtual input, and each virtual input references an input of the recipe, by index

For example, if a recipe has inputs A and B and declares two joins:

A->B

A->A (based on a computed column)

There are 3 virtual inputs:

0: points to recipe input 0 (i.e. dataset A)

1: points to recipe input 1 (i.e. dataset B)

2: points to recipe input 0 (i.e. dataset A) and includes the computed column

The first join is between virtual inputs 0 and 1
The second join is between virtual inputs 0 and 2

property raw_virtual_inputs#

Get the list of virtual inputs.

This method returns a reference to the list of inputs, not a copy. Modifying the list then calling DSSRecipeSettings.save() commits the changes.

Returns:: a list of virtual inputs, each one a dict. The field index holds the index of the dataset of this virtual input in the recipe’s list of inputs. Pre-filter, computed columns and column selection properties (if applicable) are defined in each virtual input.
Return type:: list[dict]

property raw_joins#

Get raw list of joins.

This method returns a reference to the list of joins, not a copy. Modifying the list then calling DSSRecipeSettings.save() commits the changes.

Returns:: list of the join definitions, each as a dict. The table1 and table2 fields give the indices of the virtual inputs on the left side and right side respectively.
Return type:: list[dict]

add_virtual_input(input_dataset_index)#

Add a virtual input pointing to the specified input dataset of the recipe.

Parameters:: input_dataset_index (int) – index of the dataset in the list of input_dataset_index

add_pre_join_computed_column(virtual_input_index, computed_column)#

Add a computed column to a virtual input.

You can use dataikuapi.dss.utils.DSSComputedColumn.formula() to build the computed_column object.

Parameters:

input_dataset_index (int) – index of the dataset in the list of input_dataset_index
computed_column (dict) –
a computed column definition, as a dict of:
- mode : type of expression used to define the computations. One of GREL or SQL.
- name : name of the column generated
- type : name of a DSS type for the computed column
- expr : if mode is CUSTOM, a formula in DSS formula language . If mode is SQL, a SQL expression.

add_join(join_type='LEFT', input1=0, input2=1)#

Add a join between two virtual inputs.

The join is initialized with no condition.

Use add_condition_to_join() on the return value to add a join condition (for example column equality) to the join.

Returns:: the newly added join as a dict (see raw_joins())
Return type:: dict

static add_condition_to_join(join, type='EQ', column1=None, column2=None)#

Add a condition to a join.

Parameters:

join (dict) – definition of a join
type (string) – type of join condition. Possible values are EQ, LTE, LT, GTE, GT, NE, WITHIN_RANGE, K_NEAREST, K_NEAREST_INFERIOR, CONTAINS, STARTS_WITH
column1 (string) – name of left-side column
column2 (string) – name of right-side column

add_post_join_computed_column(computed_column)#

Add a post-join computed column.

Use dataikuapi.dss.utils.DSSComputedColumn to build the computed_column object.

Note

The columns accessible to the expression of the computed column are those selected in the different joins, in their “output” form. For example if a virtual inputs 0 and 1 are joined, and column “bar” of the first input is selected with a prefix of “foo”, then the computed column can use “foobar” but not “bar”.

Parameters:

computed_column (dict) –

a computed column definition, as a dict of:

mode : type of expression used to define the computations. One of GREL or SQL.
name : name of the column generated
type : name of a DSS type for the computed column
expr : if mode is CUSTOM, a formula in DSS formula language . If mode is SQL, a SQL expression.

set_post_filter(postfilter)#

Add a post filter on the join.

Use the methods on dataikuapi.dss.utils.DSSFilter to build filter definition.

Parameters:

postfilter (dict) –

definition of a filter, as a dict of:

distinct : whether the records in the output should be deduplicated
enabled : whether filtering is enabled
uiData : settings of the filter, if enabled is True, as a dict of:
- mode : type of filter. Possible values: CUSTOM, SQL, ‘&&’ (boolean AND of conditions) and ‘||’ (boolean OR of conditions)
- conditions : if mode is ‘&&’ or ‘||’, then a list of the actual filter conditions, each one a dict
expression : if uiData.mode is CUSTOM, a formula in DSS formula language . If uiData.mode is SQL, a SQL expression.

set_unmatched_output(ref, side='right', append_mode=False)#

Adds an unmatched join output

Parameters:

ref (str) – name of the dataset
side (str) – side of the unmatched output, ‘right’ or ‘left’.
append_mode (bool) – whether the recipe should append or overwrite the output when running

class dataikuapi.dss.recipe.DownloadRecipeSettings(recipe, data)#: Settings of a download recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.SplitRecipeSettings(recipe, data)#: Settings of a split recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

class dataikuapi.dss.recipe.StackRecipeSettings(recipe, data)#: Settings of a stack recipe.

Important

Do not instantiate directly, use DSSRecipe.get_settings()

Creation#

class dataikuapi.dss.recipe.DSSRecipeCreator(type, name, project)#

Helper to create new recipes.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

set_name(name)#

Set the name of the recipe-to-be-created.

Parameters:: name (string) – a recipe name. Should only use alphanum letters and underscores. Cannot contain dots.

with_input(input_id, project_key=None, role='main')#

Add an existing object as input to the recipe-to-be-created.

Parameters:

input_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model
project_key (string) – project containing the object, if different from the one where the recipe is created
role (string) – the role of the recipe in which the input should be added. Most recipes only have one role named “main”.

with_output(output_id, append=False, role='main')#

Add an existing object as output to the recipe-to-be-created.

The output dataset must already exist.

Parameters:

output_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)
role (string) – the role of the recipe in which the input should be added. Most recipes only have one role named “main”.

build()#: Create the recipe.

Note

Deprecated. Alias to create()

create()#

Creates the new recipe in the project, and return a handle to interact with it.

Return type:: dataikuapi.dss.recipe.DSSRecipe

set_raw_mode()#

Activate raw creation mode.

Caution

For advanced uses only.

In this mode, the field “recipe_proto” of this recipe creator is used as-is to create the recipe, and if it exists, the value of creation_settings[“rawPayload”] is used as the payload of the created recipe. No checks of existence or validity of the inputs or outputs are done, and no output is auto-created.

class dataikuapi.dss.recipe.SingleOutputRecipeCreator(type, name, project)#

Create a recipe that has a single output.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

with_existing_output(output_id, append=False)#

Add an existing object as output to the recipe-to-be-created.

The output dataset must already exist.

Parameters:

output_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)

with_new_output(name, connection, type=None, format=None, override_sql_schema=None, partitioning_option_id=None, append=False, object_type='DATASET', overwrite=False, **kwargs)#

Create a new dataset or managed folder as output to the recipe-to-be-created.

The dataset or managed folder is not created immediately, but when the recipe is created (ie in the create() method). Whether a dataset is created or a managed folder is created, depends on the recipe type.

Parameters:

name (string) – name of the dataset or identifier of the managed folder
connection (string) – name of the connection to create the dataset or managed folder on
type (string) – sub-type of dataset or managed folder, for connections where the type could be ambiguous. Typically applies to SSH connections, where sub-types can be SCP or SFTP
format (string) – name of a format preset relevant for the dataset type. Possible values are: CSV_ESCAPING_NOGZIP_FORHIVE, CSV_UNIX_GZIP, CSV_EXCEL_GZIP, CSV_EXCEL_GZIP_BIGQUERY, CSV_NOQUOTING_NOGZIP_FORPIG, PARQUET_HIVE, AVRO, ORC
override_sql_schema (boolean) – schema to force dataset, for SQL dataset. If left empty, will be autodetected
partitioning_option_id (string) – to copy the partitioning schema of an existing dataset ‘foo’, pass a value of ‘copy:dataset:foo’. If unset, then the output will be non-partitioned
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)
object_type (string) – DATASET or MANAGED_FOLDER
overwrite (boolean) – If the dataset being created already exists, overwrite it (and delete data)

with_output(output_id, append=False)#: Add an existing object as output to the recipe-to-be-created.

Note

Alias of with_existing_output()

class dataikuapi.dss.recipe.VirtualInputsSingleOutputRecipeCreator(type, name, project)#

Create a recipe that has a single output and several inputs.

with_input(input_id, project_key=None)#

Add an existing object as input to the recipe-to-be-created.

Parameters:

input_id (string) – name of the dataset
project_key (string) – project containing the object, if different from the one where the recipe is created

class dataikuapi.dss.recipe.CodeRecipeCreator(name, type, project)#

Create a recipe running a script.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe()

with_script(script)#

Set the code of the recipe-to-be-created.

Parameters:: script (string) – code of the recipe

with_new_output_dataset(name, connection, type=None, format=None, copy_partitioning_from='FIRST_INPUT', append=False, overwrite=False, **kwargs)#

Create a new managed dataset as output to the recipe-to-be-created.

The dataset is created immediately.

Parameters:

name (string) – name of the dataset
connection (string) – name of the connection to create the dataset on
type (string) – sub-type of dataset or managed folder, for connections where the type could be ambiguous. Typically applies to SSH connections, where sub-types can be SCP or SFTP
format (string) – name of a format preset relevant for the dataset type. Possible values are: CSV_ESCAPING_NOGZIP_FORHIVE, CSV_UNIX_GZIP, CSV_EXCEL_GZIP, CSV_EXCEL_GZIP_BIGQUERY, CSV_NOQUOTING_NOGZIP_FORPIG, PARQUET_HIVE, AVRO, ORC
partitioning_option_id (string) – to copy the partitioning schema of an existing dataset ‘foo’, pass a value of ‘copy:dataset:foo’. If unset, then the output will be non-partitioned
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)
overwrite (boolean) – If the dataset being created already exists, overwrite it (and delete data)

with_new_output_streaming_endpoint(name, connection, format=None, overwrite=False, **kwargs)#

Create a new managed streaming endpoint as output to the recipe-to-be-created.

The streaming endpoint is created immediately.

Parameters:

name (str) – name of the streaming endpoint to create
connection (str) – name of the connection to create the streaming endpoint on
format (str) – name of a format preset relevant for the streaming endpoint type. Possible values are: json, avro, single (kafka endpoints) or json, string (SQS endpoints). If None, uses the default
overwrite – If the streaming endpoint being created already exists, overwrite it

class dataikuapi.dss.recipe.PythonRecipeCreator(name, project)#

Create a Python recipe.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

A Python recipe can be defined either by its complete code, like a normal Python recipe, or by a function signature.

with_function_name(module_name, function_name, custom_template=None, **function_args)#

Define this recipe as being a functional recipe calling a function.

With the default template, the function must take as arguments:

A list of dataframes corresponding to the dataframes of the input datasets. If there is only one input, then a single dataframe

Optional named arguments corresponding to arguments passed to the creator as kwargs

The function should then return a list of dataframes, one per recipe output. If there is a single output, it is possible to return a single dataframe rather than a list.

Parameters:

module_name (string) – name of the module where the function is defined
function_name (string) – name of the function
function_args (kwargs) – additional parameters to the function.
custom_template (string) – template to use to create the code of the recipe. The template is formatted with ‘{fname}’ (function name), ‘{module_name}’ (module name) and ‘{params_json}’ (JSON representation of function_args)

with_function(fn, custom_template=None, **function_args)#

Define this recipe as being a functional recipe calling a function.

With the default template, the function must take as arguments:

A list of dataframes corresponding to the dataframes of the input datasets. If there is only one input, then a single dataframe

Optional named arguments corresponding to arguments passed to the creator as kwargs

The function should then return a list of dataframes, one per recipe output. If there is a single output, it is possible to return a single dataframe rather than a list.

Parameters:

fn (string) – function to call
function_args (kwargs) – additional parameters to the function.
custom_template (string) – template to use to create the code of the recipe. The template is formatted with ‘{fname}’ (function name), ‘{module_name}’ (module name) and ‘{params_json}’ (JSON representation of function_args)

class dataikuapi.dss.recipe.SQLQueryRecipeCreator(name, project)#: Create a SQL query recipe.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.PrepareRecipeCreator(name, project)#: Create a Prepare recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.SyncRecipeCreator(name, project)#: Create a Sync recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.SamplingRecipeCreator(name, project)#: Create a Sample/Filter recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.DistinctRecipeCreator(name, project)#: Create a Distinct recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.GroupingRecipeCreator(name, project)#

Create a Group recipe.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

with_group_key(group_key)#

Set a column as the first grouping key.

Only a single grouping key may be set at recipe creation time. To add more grouping keys, get the recipe settings and use GroupingRecipeSettings.add_grouping_key(). To have no grouping keys at all, get the recipe settings and use GroupingRecipeSettings.clear_grouping_keys().

Parameters:: group_key (string) – name of a column in the input dataset
Returns:: self
Return type:: GroupingRecipeCreator

class dataikuapi.dss.recipe.PivotRecipeCreator(name, project)#: Create a Pivot recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.SortRecipeCreator(name, project)#: Create a Sort recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.TopNRecipeCreator(name, project)#: Create a TopN recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.WindowRecipeCreator(name, project)#: Create a Window recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.JoinRecipeCreator(name, project)#

Create a Join recipe.

The recipe is created with default joins guessed by matching column names in the inputs.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.FuzzyJoinRecipeCreator(name, project)#

Create a FuzzyJoin recipe

The recipe is created with default joins guessed by matching column names in the inputs.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.GeoJoinRecipeCreator(name, project)#

Create a GeoJoin recipe

The recipe is created with default joins guessed by matching column names in the inputs.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.SplitRecipeCreator(name, project)#: Create a Split recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.StackRecipeCreator(name, project)#: Create a Stack recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.DownloadRecipeCreator(name, project)#: Create a Download recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

class dataikuapi.dss.recipe.PredictionScoringRecipeCreator(name, project)#

Create a new Prediction scoring recipe.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

Usage example:

# Create a new prediction scoring recipe outputing to a new dataset

project = client.get_project("MYPROJECT")
builder = project.new_recipe("prediction_scoring", "my_scoring_recipe")
builder.with_input_model("saved_model_id")
builder.with_input("dataset_to_score")
builder.with_new_output("my_output_dataset", "myconnection")

# Or for a filesystem output connection
# builder.with_new_output("my_output_dataset, "filesystem_managed", format="CSV_EXCEL_GZIP")

new_recipe = builder.build()

with_input_model(model_id)#

Set the input model.

Parameters:: model_id (string) – identifier of a saved model

class dataikuapi.dss.recipe.ClusteringScoringRecipeCreator(name, project)#

Create a new Clustering scoring recipe,.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

Usage example:

# Create a new prediction scoring recipe outputing to a new dataset

project = client.get_project("MYPROJECT")
builder = project.new_recipe("clustering_scoring", "my_scoring_recipe")
builder.with_input_model("saved_model_id")
builder.with_input("dataset_to_score")
builder.with_new_output("my_output_dataset", "myconnection")

# Or for a filesystem output connection
# builder.with_new_output("my_output_dataset, "filesystem_managed", format="CSV_EXCEL_GZIP")

new_recipe = builder.build()

with_input_model(model_id)#

Set the input model.

Parameters:: model_id (string) – identifier of a saved model

class dataikuapi.dss.recipe.EvaluationRecipeCreator(name, project)#

Create a new Evaluate recipe.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

Usage example:

# Create a new evaluation recipe outputing to a new dataset, to a metrics dataset and/or to a model evaluation store

project = client.get_project("MYPROJECT")
builder = project.new_recipe("evaluation")
builder.with_input_model(saved_model_id)
builder.with_input("dataset_to_evaluate")

builder.with_output("output_scored")
builder.with_output_metrics("output_metrics")
builder.with_output_evaluation_store(evaluation_store_id)

new_recipe = builder.build()

# Access the settings

er_settings = new_recipe.get_settings()
payload = er_settings.obj_payload

# Change the settings

payload['dontComputePerformance'] = True
payload['outputProbabilities'] = False
payload['metrics'] = ["precision", "recall", "auc", "f1", "costMatrixGain"]

# Manage evaluation labels

payload['labels'] = [dict(key="label_1", value="value_1"), dict(key="label_2", value="value_2")]

# Save the settings and run the recipe

er_settings.save()

new_recipe.run()

Outputs must exist. They can be created using the following:

builder = project.new_managed_dataset("output_scored")
builder.with_store_into(connection)
dataset = builder.create()

builder = project.new_managed_dataset("output_scored")
builder.with_store_into(connection)
dataset = builder.create()

evaluation_store_id = project.create_model_evaluation_store("output_model_evaluation").mes_id

with_input_model(model_id)#

Set the input model.

Parameters:: model_id (string) – identifier of a saved model

with_output(output_id)#

Set the output dataset containing the scored input.

Parameters:: output_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model

with_output_metrics(name)#

Set the output dataset containing the metrics.

Parameters:: name (string) – name of an existing dataset

with_output_evaluation_store(mes_id)#

Set the output model evaluation store.

Parameters:: mes_id (string) – identifier of a model evaluation store

class dataikuapi.dss.recipe.StandaloneEvaluationRecipeCreator(name, project)#

Create a new Standalone Evaluate recipe.

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

Usage example:

# Create a new standalone evaluation of a scored dataset

project = client.get_project("MYPROJECT")
builder = project.new_recipe("standalone_evaluation")
builder.with_input("scored_dataset_to_evaluate")
builder.with_output_evaluation_store(evaluation_store_id)

# Add a reference dataset (optional) to compute data drift

builder.with_reference_dataset("reference_dataset")

# Finish creation of the recipe

new_recipe = builder.create()

# Modify the model parameters in the SER settings

ser_settings = new_recipe.get_settings()
payload = ser_settings.obj_payload

payload['predictionType'] = "BINARY_CLASSIFICATION"
payload['targetVariable'] = "Survived"
payload['predictionVariable'] = "prediction"
payload['isProbaAware'] = True
payload['dontComputePerformance'] = False

# For a classification model with probabilities, the 'probas' section can be filled with the mapping of the class and the probability column
# e.g. for a binary classification model with 2 columns: proba_0 and proba_1

class_0 = dict(key=0, value="proba_0")
class_1 = dict(key=1, value="proba_1")
payload['probas'] = [class_0, class_1]

# Change the 'features' settings for this standalone evaluation
# e.g. reject the features that you do not want to use in the evaluation

feature_passengerid = dict(name="Passenger_Id", role="REJECT", type="TEXT")
feature_ticket = dict(name="Ticket", role="REJECT", type="TEXT")
feature_cabin = dict(name="Cabin", role="REJECT", type="TEXT")

payload['features'] = [feature_passengerid, feature_ticket, feature_cabin]

# To set the cost matrix properly, access the 'metricParams' section of the payload and set the cost matrix weights:

payload['metricParams'] = dict(costMatrixWeights=dict(tpGain=0.4, fpGain=-1.0, tnGain=0.2, fnGain=-0.5))

# Save the recipe and run the recipe
# Note that with this method, all the settings that were not explicitly set are instead set to their default value.

ser_settings.save()

new_recipe.run()

Output model evaluation store must exist. It can be created using the following:

evaluation_store_id = project.create_model_evaluation_store("output_model_evaluation").mes_id

with_output_evaluation_store(mes_id)#

Set the output model evaluation store.

Parameters:: mes_id (string) – identifier of a model evaluation store

with_reference_dataset(dataset_name)#

Set the dataset to use as a reference in data drift computation.

Parameters:: dataset_name (string) – name of a dataset

class dataikuapi.dss.recipe.ContinuousSyncRecipeCreator(name, project)#: Create a continuous Sync recipe

Important

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_recipe() instead.

Utilities#

class dataikuapi.dss.utils.DSSComputedColumn#

static formula(name, formula, type='double')#

Create a computed column with a formula.

Parameters:

name (string) – a name for the computed column
formula (string) – formula to compute values, using the GREL language
type (string) – name of a DSS type for the values of the column

Returns:

a computed column as a dict

Return type:

dict

class dataikuapi.dss.utils.DSSFilter#

Helper class to build filter objects for use in visual recipes.

static of_single_condition(column, operator, string=None, num=None, date=None, time=None, date2=None, time2=None, unit=None)#

Create a simple filter on a column.

Which of the ‘string’, ‘num’, ‘date’, ‘time’, ‘date2’ and ‘time2’ parameter holds the literal to filter against depends on the filter operator.

Parameters:

column (string) – name of a column to filter (left operand)
operator (string) – type of filter applied to the column, one of the values in the DSSFilterOperator enum
string (string) – string literal for the right operand
num (string) – numeric literal for the right operand
date (string) – date part literal for the right operand
time (string) – time part literal for the right operand
date2 (string) – date part literal for the right operand of BETWEEN_DATE
time2 (string) – time part literal for the right operand of BETWEEN_DATE
unit (string) – date/time rounding for date operations. Possible values are YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND

static of_and_conditions(conditions)#

Create a filter as an intersection of conditions.

The resulting filter keeps rows that match all the conditions in the list. Conditions are for example the output of condition().

Parameters:: conditions (list) – a list of conditions
Returns:: a filter, as a dict
Return type:: dict

static of_or_conditions(conditions)#

Create a filter as an union of conditions.

The resulting filter keeps rows that match any of the conditions in the list. Conditions are for example the output of condition().

Parameters:: conditions (list) – a list of conditions
Returns:: a filter, as a dict
Return type:: dict

static of_formula(formula)#

Create a filter that applies a GREL formula.

The resulting filter evaluates the formula and keeps rows for which the formula returns a True value.

Parameters:: formula (string) – a GREL formula
Returns:: a filter, as a dict
Return type:: dict

static of_sql_expression(sql_expression)#

Create a filter that applies a SQL expression.

The resulting filter evaluates the sql expression and keeps rows for which the sql expression returns a True value.

Parameters:: sql_expression (string) – a SQL expression
Returns:: a filter, as a dict
Return type:: dict

static condition(column, operator, string=None, num=None, date=None, time=None, date2=None, time2=None, unit=None)#

Create a condition on a column for a filter.

Which of the ‘string’, ‘num’, ‘date’, ‘time’, ‘date2’ and ‘time2’ parameter holds the literal to filter against depends on the filter operator.

Parameters:

column (string) – name of a column to filter (left operand)
operator (string) – type of filter applied to the column, one of the values in the DSSFilterOperator enum
string (string) – string literal for the right operand
num (string) – numeric literal for the right operand
date (string) – date part literal for the right operand
time (string) – time part literal for the right operand
date2 (string) – date part literal for the right operand of BETWEEN_DATE
time2 (string) – time part literal for the right operand of BETWEEN_DATE
unit (string) – date/time rounding for date operations. Possible values are YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND

class dataikuapi.dss.utils.DSSFilterOperator(value)#

An enumeration.

EMPTY_ARRAY = 'empty array'#: Test if an array is empty.

NOT_EMPTY_ARRAY = 'not empty array'#: Test if an array is not empty.

CONTAINS_ARRAY = 'array contains'#: Test if an array contains a value.

NOT_EMPTY = 'not empty'#: Test if a value is not empty and not null.

EMPTY = 'is empty'#: Test if a value is empty or null.

NOT_EMPTY_STRING = 'not empty string'#: Test if a string is not empty.

EMPTY_STRING = 'empty string'#: Test if a string is empty.

IS_TRUE = 'true'#: Test if a boolean is true.

IS_FALSE = 'false'#: Test if a boolean is false.

EQUALS_STRING = '== [string]'#: Test if a string is equal to a given value.

EQUALS_CASE_INSENSITIVE_STRING = '== [string]i'#: Test if a string is equal to a given value, ignoring case.

NOT_EQUALS_STRING = '!= [string]'#: Test if a string is not equal to a given value.

SAME = '== [NaNcolumn]'#: Test if two columns have the same value when formatted to string.

DIFFERENT = '!= [NaNcolumn]'#: Test if two columns have different values when formatted to string.

EQUALS_NUMBER = '== [number]'#: Test if a number is equal to a given value.

NOT_EQUALS_NUMBER = '!= [number]'#: Test if a number is not equal to a given value.

GREATER_NUMBER = '> [number]'#: Test if a number is greater than a given value.

LESS_NUMBER = '< [number]'#: Test if a number is less than a given value.

GREATER_OR_EQUAL_NUMBER = '>= [number]'#: Test if a number is greater or equal to a given value.

LESS_OR_EQUAL_NUMBER = '<= [number]'#: Test if a number is less or equal to a given value.

EQUALS_DATE = '== [date]'#: Test if a date/time is equal to a given date/time (rounded).

GREATER_DATE = '> [date]'#: Test if a date/time is greater than a given date/time.

GREATER_OR_EQUAL_DATE = '>= [date]'#: Test if a date/time is greater or equal than a given date/time.

LESS_DATE = '< [date]'#: Test if a date/time is less than a given date/time.

LESS_OR_EQUAL_DATE = '<= [date]'#: Test if a date/time is less or equal than a given date/time.

BETWEEN_DATE = '>< [date]'#: Test if a date/time is between two given date/times.

EQUALS_COL = '== [column]'#: Test if two columns have the same (typed) value.

NOT_EQUALS_COL = '!= [column]'#: Test if two columns have different (typed) values.

GREATER_COL = '> [column]'#: Test if one column is greater than another.

LESS_COL = '< [column]'#: Test if one column is less than another.

GREATER_OR_EQUAL_COL = '>= [column]'#: Test if one column is greater or equal than another.

LESS_OR_EQUAL_COL = '<= [column]'#: Test if one column is less or equal than another.

CONTAINS_STRING = 'contains'#: Test if a column contains a given string.

REGEX = 'regex'#: Test if a column matches a regular expression.