Recipes#
For usage information and examples, see Recipes
- class dataikuapi.dss.recipe.DSSRecipe(client, project_key, recipe_name)#
A handle to an existing recipe on the DSS instance.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.get_recipe()
- property id#
Get the identifier of the recipe.
For recipes, the name is the identifier.
- Return type:
string
- property name#
Get the name of the recipe.
- Return type:
string
- compute_schema_updates()#
Computes which updates are required to the outputs of this recipe.
This method only computes which changes would be needed to make the schema of the outputs of the recipe match the actual schema that the recipe will produce. To effectively apply these changes to the outputs, you can use the
apply()
on the returned object.Note
Not all recipe types can compute automatically the schema of their outputs. Code recipes like Python recipes, notably can’t. This method raises an exception in these cases.
Usage example:
required_updates = recipe.compute_schema_updates() if required_updates.any_action_required(): print("Some schemas will be updated") # Note that you can call apply even if no changes are required. This will be noop required_updates.apply()
- Returns:
an object containing the required updates
- Return type:
- run(job_type='NON_RECURSIVE_FORCED_BUILD', partitions=None, wait=True, no_fail=False)#
Starts a new job to run this recipe and wait for it to complete.
Raises if the job failed.
job = recipe.run() print("Job %s done" % job.id)
- Parameters:
job_type (string) – job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD
partitions (string) – if the outputs are partitioned, a partition spec. A spec is a comma-separated list of partition identifiers, and a partition identifier is a pipe-separated list of values for the partitioning dimensions
no_fail (boolean) – if True, does not raise if the job failed
wait (boolean) – if True, the method waits for the job complettion. If False, the method returns immediately
- Returns:
a job handle corresponding to the recipe run
- Return type:
- delete()#
Delete the recipe.
- rename(new_name)#
Rename the recipe with the new specified name
- Parameters:
new_name (str) – the new name of the recipe
- get_settings()#
Get the settings of the recipe, as a
DSSRecipeSettings
or one of its subclasses.Some recipes have a dedicated class for the settings, with additional helpers to read and modify the settings
Once you are done modifying the returned settings object, you can call
save()
on it in order to save the modifications to the DSS recipe.- Return type:
DSSRecipeSettings
or a subclass
- get_definition_and_payload()#
Get the definition of the recipe.
Attention
Deprecated. Use
get_settings()
- Returns:
an object holding both the raw definition of the recipe (the type, which inputs and outputs, engine settings…) and the payload (SQL script, Python code, join definition,… depending on type)
- Return type:
DSSRecipeDefinitionAndPayload
- set_definition_and_payload(definition)#
Set the definition of the recipe.
Attention
Deprecated. Use
get_settings()
thenDSSRecipeSettings.save()
Important
The definition parameter should come from a call to
get_definition()
- Parameters:
definition (object) – a recipe definition, as returned by
get_definition()
- get_status()#
Gets the status of this recipe.
The status of a recipe is made of messages from checks performed by DSS on the recipe, of messages related to engines availability for the recipe, of messages about testing the recipe on the engine, …
- Returns:
an object to interact with the status
- Return type:
- get_metadata()#
Get the metadata attached to this recipe.
The metadata contains label, description checklists, tags and custom metadata of the recipe
- Returns:
the metadata as a dict, with fields:
label : label of the object (not defined for recipes)
description : description of the object (not defined for recipes)
checklists : checklists of the object, as a dict with a checklists field, which is a list of checklists, each a dict.
tags : list of tags, each a string
custom : custom metadata, as a dict with a kv field, which is a dict with any contents the user wishes
customFields : dict of custom field info (not defined for recipes)
- Return type:
dict
- set_metadata(metadata)#
Set the metadata on this recipe.
Important
You should only set a metadata object that has been retrieved using
get_metadata()
.- Params dict metadata:
the new state of the metadata for the recipe.
- get_object_discussions()#
Get a handle to manage discussions on the recipe.
- Returns:
the handle to manage discussions
- Return type:
- get_continuous_activity()#
Get a handle on the associated continuous activity.
Note
Should only be used on continuous recipes.
- move_to_zone(zone)#
Move this object to a flow zone.
- Parameters:
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to move the object
- class dataikuapi.dss.recipe.DSSRecipeListItem(client, data)#
An item in a list of recipes.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.list_recipes()
- property name#
Get the name of the recipe.
- Return type:
string
- property id#
Get the identifier of the recipe.
For recipes, the name is the identifier.
- Return type:
string
- property type#
Get the type of the recipe.
- Returns:
a recipe type, for example ‘sync’ or ‘join’
- Return type:
string
- property tags#
- class dataikuapi.dss.recipe.DSSRecipeStatus(client, data)#
Status of a recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_status()
- get_selected_engine_details()#
Get the selected engine for this recipe.
This method will raise if there is no selected engine, whether it’s because the present recipe type has no notion of engine, or because DSS couldn’t find any viable engine for running the recipe.
- Returns:
a dict of the details of the selected engine. The type of engine is in a type field. Depending on the type, additional field will give more details, like whether some aggregations are possible.
- Return type:
dict
- get_engines_details()#
Get details about all possible engines for this recipe.
This method will raise if there is no engine, whether it’s because the present recipe type has no notion of engine, or because DSS couldn’t find any viable engine for running the recipe.
- Returns:
a list of dict of the details of each possible engine. See
get_selected_engine_details()
for the fields of each dict.- Return type:
list[dict]
- get_status_severity()#
Get the overall status of the recipe.
This is the final result of checking the different parts of the recipe, and depends on the recipe type. Examples of checks done include:
checking the validity of the formulas in computed columns or filters
checking if some of the input columns retrieved by joins overlap
checking against the SQL database if the generated SQL is valid
- Returns:
SUCCESS, WARNING, ERROR or INFO. None if the status has no message at all.
- Return type:
string
- get_status_messages(as_objects=False)#
Returns status messages for this recipe.
- Parameters:
as_objects (boolean) – if True, return a list of
dataikuapi.dss.utils.DSSInfoMessage
. If False, as a list of raw dicts.- Returns:
if as_objects is True, a list of
dataikuapi.dss.utils.DSSInfoMessage
, otherwise a list of message information, each one a dict of:severity : severity of the error in the message. Possible values are SUCCESS, INFO, WARNING, ERROR
isFatal : for ERROR severity, whether the error is considered fatal to the operation
code : a string with a well-known code documented in DSS doc
title : short message
message : the error message
details : a more detailed error description
- Return type:
list
- class dataikuapi.dss.recipe.RequiredSchemaUpdates(recipe, data)#
Handle on a set of required updates to the schema of the outputs of a recipe.
Important
Do not instantiate directly, use
DSSRecipe.compute_schema_updates()
For example, changes can be new columns in the output of a Group recipe when new aggregates are activated in the recipe’s settings.
- any_action_required()#
Whether there are changes at all.
- Return type:
boolean
- apply()#
Apply the changes.
All the updates found to be required are applied, for each of the recipe’s outputs.
Settings#
- class dataikuapi.dss.recipe.DSSRecipeSettings(recipe, data)#
Settings of a recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- save()#
Save back the recipe in DSS.
- property type#
Get the type of the recipe.
- Returns:
a type, like ‘sync’, ‘python’ or ‘join’
- Return type:
string
- property str_payload#
The raw “payload” of the recipe.
This is exactly the data persisted on disk.
- Returns:
for code recipes, the payload will be the script of the recipe. For visual recipes, the payload is a JSON of settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
- Return type:
string
- property obj_payload#
The “payload” of the recipe, parsed from JSON.
Note
Do not use on code recipes, their payload isn’t JSON-encoded.
- Returns:
settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
- Return type:
dict
- property raw_params#
The non-payload settings of the recipe.
- Returns:
recipe type-specific settings that aren’t stored in the payload. Typically this comprises engine settings.
- Return type:
dict
- get_recipe_raw_definition()#
Get the recipe definition.
- Returns:
the part of the recipe’s settings that aren’t stored in the payload, as a dict. Notable fields are:
name and projectKey : identifiers of the recipe
type : type of the recipe
params : type-specific parameters of the recipe (on top of what is in the payload)
inputs : input roles to the recipe, as a dict of role name to role, where a role is a dict with an items field consisting of a list of one dict per input object. Each individual input has fields:
ref : a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
deps : for partitioned inputs, a list of partition dependencies mapping output dimensions to dimensions in this input. Each partition dependency is a dict.
outputs : output roles to the recipe, as a dict of role name to role, where a role is a dict with a items field consisting of a list of one dict per output object. Each individual output has fields:
ref : a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
appendMode : if True, the recipe should append into the output; if False, the recipe should overwrite the output when running
- Return type:
dict
- get_recipe_inputs()#
Get the inputs to this recipe.
- Return type:
dict
- get_recipe_outputs()#
Get the outputs of this recipe.
- Return type:
dict
- get_recipe_params()#
The non-payload settings of the recipe.
- Returns:
recipe type-specific settings that aren’t stored in the payload. Typically this comprises engine settings.
- Return type:
dict
- get_payload()#
The raw “payload” of the recipe.
This is exactly the data persisted on disk.
- Returns:
for code recipes, the payload will be the script of the recipe. For visual recipes, the payload is a JSON of settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
- Return type:
string
- get_json_payload()#
The “payload” of the recipe, parsed from JSON.
Note
Do not use on code recipes, their payload isn’t JSON-encoded.
- Returns:
settings that are specific to the recipe type, like the definitions of the aggregations for a grouping recipe.
- Return type:
dict
- set_payload(payload)#
Set the payload of this recipe.
- Parameters:
payload (string) – the payload, as a string
- set_json_payload(payload)#
Set the payload of this recipe.
- Parameters:
payload (dict) – the payload, as a dict. Will be converted to JSON internally.
- has_input(input_ref)#
Whether a ref is part of the recipe’s inputs.
- Parameters:
input_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
- Return type:
boolean
- has_output(output_ref)#
Whether a ref is part of the recipe’s outputs.
- Parameters:
output_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id. Should be prefixed by the project key for exposed items, like in “PROJECT_KEY.dataset_name”
- Return type:
boolean
- replace_input(current_input_ref, new_input_ref)#
Replaces an input of this recipe by another.
If the current_input_ref isn’t part of the recipe’s inputs, this method has no effect.
- Parameters:
current_input_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that is currently input to the recipe
new_input_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that current_input_ref should be replaced with.
- replace_output(current_output_ref, new_output_ref)#
Replaces an output of this recipe by another.
If the current_output_ref isn’t part of the recipe’s outputs, this method has no effect.
- Parameters:
current_output_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that is currently output to the recipe
new_output_ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id, that current_output_ref should be replaced with.
- add_input(role, ref, partition_deps=None)#
Add an input to the recipe.
For most recipes, there is only one role, named “main”. Some few recipes have additional roles, like scoring recipes which have a “model” role. Check the roles known to the recipe with
get_recipe_inputs()
.- Parameters:
role (string) – name of the role of the recipe in which to add ref as input
ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id
partition_deps (list) – if ref points to a partitioned object, a list of partition dependencies, one per dimension in the partitioning scheme
- add_output(role, ref, append_mode=False)#
Add an output to the recipe.
For most recipes, there is only one role, named “main”. Some few recipes have additional roles, like evaluation recipes which have a “metrics” role. Check the roles known to the recipe with
get_recipe_outputs()
.- Parameters:
role (string) – name of the role of the recipe in which to add ref as input
ref (string) – a ref to an object in DSS, i.e. a dataset name or a managed folder id or a saved model id
partition_deps (list) – if ref points to a partitioned object, a list of partition dependencies, one per dimension in the partitioning scheme
- get_flat_input_refs()#
List all input refs of this recipe, regardless of the input role.
- Returns:
a list of refs, i.e. of dataset names or managed folder ids or saved model ids
- Return type:
list[string]
- get_flat_output_refs()#
List all output refs of this recipe, regardless of the input role.
- Returns:
a list of refs, i.e. of dataset names or managed folder ids or saved model ids
- Return type:
list[string]
- property custom_fields#
The custom fields of the object as a dict. Returns None if there are no custom fields
- property description#
The description of the object as a string
- property short_description#
The short description of the object as a string
- property tags#
The tags of the object, as a list of strings
- class dataikuapi.dss.recipe.DSSRecipeDefinitionAndPayload(recipe, data)
Settings of a recipe.
Note
Deprecated. Alias to
DSSRecipeSettings
, useDSSRecipe.get_settings()
instead.
- class dataikuapi.dss.recipe.CodeRecipeSettings(recipe, data)#
Settings of a code recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- get_code()#
Get the code of the recipe.
- Return type:
string
- set_code(code)#
Update the code of the recipe.
- Parameters:
code (string) – the new code
- get_code_env_settings()#
Get the code env settings for this recipe.
- Returns:
settings to select the code env used by the recipe, as a dict of:
envMode : one of USE_BUILTIN_MODE, INHERIT (inherit from project settings and/or instance settings), EXPLICIT_ENV
envName : if envMode is EXPLICIT_ENV, the name of the code env to use
- Return type:
dict
- set_code_env(code_env=None, inherit=False, use_builtin=False)#
Set which code env this recipe uses.
Exactly one of code_env, inherit or use_builtin must be passed.
- Parameters:
code_env (string) – name of a code env
inherit (boolean) – if True, use the project’s default code env
use_builtin (boolean) – if true, use the builtin code env
- class dataikuapi.dss.recipe.SyncRecipeSettings(recipe, data)#
Settings of a Sync recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.PrepareRecipeSettings(recipe, data)#
Settings of a Prepare recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- property raw_steps#
Get the list of the steps of this prepare recipe.
This method returns a reference to the list of steps, not a copy. Modifying the list then calling
DSSRecipeSettings.save()
commits the changes.- Returns:
list of steps, each step as a dict. The precise settings for each step are not documented, but each dict has at least fields:
metaType : one of PROCESSOR or GROUP. If GROUP, there there is a field steps with a sub-list of steps.
type : type of the step, for example FillEmptyWithValue or ColumnRenamer (there are many types of steps)
params : dict of the step’s own parameters. Each step type has its own parameters.
disabled : whether the step is disabled
name : label of the step
- Return type:
list[dict]
- add_processor_step(type, params)#
Add a step in the script.
- Parameters:
type (string) – type of the step, for example FillEmptyWithValue or ColumnRenamer (there are many types of steps)
params (dict) – dict of the step’s own parameters. Each step type has its own parameters.
- class dataikuapi.dss.recipe.SamplingRecipeSettings(recipe, data)#
Settings of a sampling recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.GroupingRecipeSettings(recipe, data)#
Settings of a grouping recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- clear_grouping_keys()#
Clear all grouping keys.
- add_grouping_key(column)#
Adds grouping on a column.
- Parameters:
column (string) – column to group on
- set_global_count_enabled(enabled)#
Activate computing the count of records per group.
- Parameters:
enabled (boolean) – True if the global count should be activated
- get_or_create_column_settings(column)#
Get a dict representing the aggregations to perform on a column.
If the column has no aggregation on it yet, the dict is created and added to the settings.
- Parameters:
column (string) – name of the column to aggregate on
- Returns:
the settings of the aggregations on a particular column, as a dict. The name of the column to perform aggregates on is in a column field, and the aggregates are toggled on or off with boolean fields.
- Return type:
dict
- set_column_aggregations(column, type=None, min=False, max=False, count=False, count_distinct=False, sum=False, concat=False, stddev=False, avg=False)#
Set the basic aggregations on a column.
Note
Not all aggregations may be possible. For example string-typed columns don’t have a mean or standard deviation, and some SQL databases can’t compute the exact standard deviation.
The method returns a reference to the settings of the column, not a copy. Modifying the dict returned by the method, then calling
DSSRecipeSettings.save()
will commit the changes.Usage example:
# activate the concat aggregate on a column, and set optional parameters # pertaining to concatenation settings = recipe.get_settings() column_settings = settings.set_column_aggregations("my_column_name", concat=True) column_settings["concatDistinct"] = True column_settings["concatSeparator"] = ', ' settings.save()
- Parameters:
column (string) – The column name
type (string) – The type of the column (as a DSS schema type name)
min (boolean) – whether the min aggregate is computed
max (boolean) – whether the max aggregate is computed
count (boolean) – whether the count aggregate is computed
count_distinct (boolean) – whether the count distinct aggregate is computed
sum (boolean) – whether the sum aggregate is computed
concat (boolean) – whether the concat aggregate is computed
avg (boolean) – whether the mean aggregate is computed
stddev (boolean) – whether the standard deviation aggregate is computed
- Returns:
the settings of the aggregations on a the column, as a dict. The name of the column is in a column field.
- Return type:
dict
- class dataikuapi.dss.recipe.SortRecipeSettings(recipe, data)#
Settings of a Sort recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.TopNRecipeSettings(recipe, data)#
Settings of a TopN recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.DistinctRecipeSettings(recipe, data)#
Settings of a Distinct recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.PivotRecipeSettings(recipe, data)#
Settings of a Pivot recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.WindowRecipeSettings(recipe, data)#
Settings of a Window recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.JoinRecipeSettings(recipe, data)#
Settings of a join recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
In order to enable self-joins, join recipes are based on a concept of “virtual inputs”. Every join, computed pre-join column, pre-join filter, … is based on one virtual input, and each virtual input references an input of the recipe, by index
For example, if a recipe has inputs A and B and declares two joins:
A->B
A->A (based on a computed column)
There are 3 virtual inputs:
0: points to recipe input 0 (i.e. dataset A)
1: points to recipe input 1 (i.e. dataset B)
2: points to recipe input 0 (i.e. dataset A) and includes the computed column
The first join is between virtual inputs 0 and 1
The second join is between virtual inputs 0 and 2
- property raw_virtual_inputs#
Get the list of virtual inputs.
This method returns a reference to the list of inputs, not a copy. Modifying the list then calling
DSSRecipeSettings.save()
commits the changes.- Returns:
a list of virtual inputs, each one a dict. The field index holds the index of the dataset of this virtual input in the recipe’s list of inputs. Pre-filter, computed columns and column selection properties (if applicable) are defined in each virtual input.
- Return type:
list[dict]
- property raw_joins#
Get raw list of joins.
This method returns a reference to the list of joins, not a copy. Modifying the list then calling
DSSRecipeSettings.save()
commits the changes.- Returns:
list of the join definitions, each as a dict. The table1 and table2 fields give the indices of the virtual inputs on the left side and right side respectively.
- Return type:
list[dict]
- add_virtual_input(input_dataset_index)#
Add a virtual input pointing to the specified input dataset of the recipe.
- Parameters:
input_dataset_index (int) – index of the dataset in the list of input_dataset_index
- add_pre_join_computed_column(virtual_input_index, computed_column)#
Add a computed column to a virtual input.
You can use
dataikuapi.dss.utils.DSSComputedColumn.formula()
to build the computed_column object.- Parameters:
input_dataset_index (int) – index of the dataset in the list of input_dataset_index
computed_column (dict) –
a computed column definition, as a dict of:
mode : type of expression used to define the computations. One of GREL or SQL.
name : name of the column generated
type : name of a DSS type for the computed column
expr : if mode is CUSTOM, a formula in DSS formula language . If mode is SQL, a SQL expression.
- add_join(join_type='LEFT', input1=0, input2=1)#
Add a join between two virtual inputs.
The join is initialized with no condition.
Use
add_condition_to_join()
on the return value to add a join condition (for example column equality) to the join.- Returns:
the newly added join as a dict (see
raw_joins()
)- Return type:
dict
- static add_condition_to_join(join, type='EQ', column1=None, column2=None)#
Add a condition to a join.
- Parameters:
join (dict) – definition of a join
type (string) – type of join condition. Possible values are EQ, LTE, LT, GTE, GT, NE, WITHIN_RANGE, K_NEAREST, K_NEAREST_INFERIOR, CONTAINS, STARTS_WITH
column1 (string) – name of left-side column
column2 (string) – name of right-side column
- add_post_join_computed_column(computed_column)#
Add a post-join computed column.
Use
dataikuapi.dss.utils.DSSComputedColumn
to build the computed_column object.Note
The columns accessible to the expression of the computed column are those selected in the different joins, in their “output” form. For example if a virtual inputs 0 and 1 are joined, and column “bar” of the first input is selected with a prefix of “foo”, then the computed column can use “foobar” but not “bar”.
- Parameters:
computed_column (dict) –
a computed column definition, as a dict of:
mode : type of expression used to define the computations. One of GREL or SQL.
name : name of the column generated
type : name of a DSS type for the computed column
expr : if mode is CUSTOM, a formula in DSS formula language . If mode is SQL, a SQL expression.
- set_post_filter(postfilter)#
Add a post filter on the join.
Use the methods on
dataikuapi.dss.utils.DSSFilter
to build filter definition.- Parameters:
postfilter (dict) –
definition of a filter, as a dict of:
distinct : whether the records in the output should be deduplicated
enabled : whether filtering is enabled
uiData : settings of the filter, if enabled is True, as a dict of:
mode : type of filter. Possible values: CUSTOM, SQL, ‘&&’ (boolean AND of conditions) and ‘||’ (boolean OR of conditions)
conditions : if mode is ‘&&’ or ‘||’, then a list of the actual filter conditions, each one a dict
expression : if uiData.mode is CUSTOM, a formula in DSS formula language . If uiData.mode is SQL, a SQL expression.
- set_unmatched_output(ref, side='right', append_mode=False)#
Adds an unmatched join output
- Parameters:
ref (str) – name of the dataset
side (str) – side of the unmatched output, ‘right’ or ‘left’.
append_mode (bool) – whether the recipe should append or overwrite the output when running
- class dataikuapi.dss.recipe.DownloadRecipeSettings(recipe, data)#
Settings of a download recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.SplitRecipeSettings(recipe, data)#
Settings of a split recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
- class dataikuapi.dss.recipe.StackRecipeSettings(recipe, data)#
Settings of a stack recipe.
Important
Do not instantiate directly, use
DSSRecipe.get_settings()
Creation#
- class dataikuapi.dss.recipe.DSSRecipeCreator(type, name, project)#
Helper to create new recipes.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.- set_name(name)#
Set the name of the recipe-to-be-created.
- Parameters:
name (string) – a recipe name. Should only use alphanum letters and underscores. Cannot contain dots.
- with_input(input_id, project_key=None, role='main')#
Add an existing object as input to the recipe-to-be-created.
- Parameters:
input_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model
project_key (string) – project containing the object, if different from the one where the recipe is created
role (string) – the role of the recipe in which the input should be added. Most recipes only have one role named “main”.
- with_output(output_id, append=False, role='main')#
Add an existing object as output to the recipe-to-be-created.
The output dataset must already exist.
- Parameters:
output_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)
role (string) – the role of the recipe in which the input should be added. Most recipes only have one role named “main”.
- create()#
Creates the new recipe in the project, and return a handle to interact with it.
- Return type:
- set_raw_mode()#
Activate raw creation mode.
Caution
For advanced uses only.
In this mode, the field “recipe_proto” of this recipe creator is used as-is to create the recipe, and if it exists, the value of creation_settings[“rawPayload”] is used as the payload of the created recipe. No checks of existence or validity of the inputs or outputs are done, and no output is auto-created.
- class dataikuapi.dss.recipe.SingleOutputRecipeCreator(type, name, project)#
Create a recipe that has a single output.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.- with_existing_output(output_id, append=False)#
Add an existing object as output to the recipe-to-be-created.
The output dataset must already exist.
- Parameters:
output_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)
- with_new_output(name, connection, type=None, format=None, override_sql_schema=None, partitioning_option_id=None, append=False, object_type='DATASET', overwrite=False, **kwargs)#
Create a new dataset or managed folder as output to the recipe-to-be-created.
The dataset or managed folder is not created immediately, but when the recipe is created (ie in the create() method). Whether a dataset is created or a managed folder is created, depends on the recipe type.
- Parameters:
name (string) – name of the dataset or identifier of the managed folder
connection (string) – name of the connection to create the dataset or managed folder on
type (string) – sub-type of dataset or managed folder, for connections where the type could be ambiguous. Typically applies to SSH connections, where sub-types can be SCP or SFTP
format (string) – name of a format preset relevant for the dataset type. Possible values are: CSV_ESCAPING_NOGZIP_FORHIVE, CSV_UNIX_GZIP, CSV_EXCEL_GZIP, CSV_EXCEL_GZIP_BIGQUERY, CSV_NOQUOTING_NOGZIP_FORPIG, PARQUET_HIVE, AVRO, ORC
override_sql_schema (boolean) – schema to force dataset, for SQL dataset. If left empty, will be autodetected
partitioning_option_id (string) – to copy the partitioning schema of an existing dataset ‘foo’, pass a value of ‘copy:dataset:foo’. If unset, then the output will be non-partitioned
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)
object_type (string) – DATASET or MANAGED_FOLDER
overwrite (boolean) – If the dataset being created already exists, overwrite it (and delete data)
- with_output(output_id, append=False)#
Add an existing object as output to the recipe-to-be-created.
Note
Alias of
with_existing_output()
- class dataikuapi.dss.recipe.VirtualInputsSingleOutputRecipeCreator(type, name, project)#
Create a recipe that has a single output and several inputs.
- with_input(input_id, project_key=None)#
Add an existing object as input to the recipe-to-be-created.
- Parameters:
input_id (string) – name of the dataset
project_key (string) – project containing the object, if different from the one where the recipe is created
- class dataikuapi.dss.recipe.CodeRecipeCreator(name, type, project)#
Create a recipe running a script.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
- with_script(script)#
Set the code of the recipe-to-be-created.
- Parameters:
script (string) – code of the recipe
- with_new_output_dataset(name, connection, type=None, format=None, copy_partitioning_from='FIRST_INPUT', append=False, overwrite=False, **kwargs)#
Create a new managed dataset as output to the recipe-to-be-created.
The dataset is created immediately.
- Parameters:
name (string) – name of the dataset
connection (string) – name of the connection to create the dataset on
type (string) – sub-type of dataset or managed folder, for connections where the type could be ambiguous. Typically applies to SSH connections, where sub-types can be SCP or SFTP
format (string) – name of a format preset relevant for the dataset type. Possible values are: CSV_ESCAPING_NOGZIP_FORHIVE, CSV_UNIX_GZIP, CSV_EXCEL_GZIP, CSV_EXCEL_GZIP_BIGQUERY, CSV_NOQUOTING_NOGZIP_FORPIG, PARQUET_HIVE, AVRO, ORC
partitioning_option_id (string) – to copy the partitioning schema of an existing dataset ‘foo’, pass a value of ‘copy:dataset:foo’. If unset, then the output will be non-partitioned
append (boolean) – whether the recipe should append or overwrite the output when running (note: not available for all dataset types)
overwrite (boolean) – If the dataset being created already exists, overwrite it (and delete data)
- with_new_output_streaming_endpoint(name, connection, format=None, overwrite=False, **kwargs)#
Create a new managed streaming endpoint as output to the recipe-to-be-created.
The streaming endpoint is created immediately.
- Parameters:
name (str) – name of the streaming endpoint to create
connection (str) – name of the connection to create the streaming endpoint on
format (str) – name of a format preset relevant for the streaming endpoint type. Possible values are: json, avro, single (kafka endpoints) or json, string (SQS endpoints). If None, uses the default
overwrite – If the streaming endpoint being created already exists, overwrite it
- class dataikuapi.dss.recipe.PythonRecipeCreator(name, project)#
Create a Python recipe.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.A Python recipe can be defined either by its complete code, like a normal Python recipe, or by a function signature.
- with_function_name(module_name, function_name, custom_template=None, **function_args)#
Define this recipe as being a functional recipe calling a function.
With the default template, the function must take as arguments:
A list of dataframes corresponding to the dataframes of the input datasets. If there is only one input, then a single dataframe
Optional named arguments corresponding to arguments passed to the creator as kwargs
The function should then return a list of dataframes, one per recipe output. If there is a single output, it is possible to return a single dataframe rather than a list.
- Parameters:
module_name (string) – name of the module where the function is defined
function_name (string) – name of the function
function_args (kwargs) – additional parameters to the function.
custom_template (string) – template to use to create the code of the recipe. The template is formatted with ‘{fname}’ (function name), ‘{module_name}’ (module name) and ‘{params_json}’ (JSON representation of function_args)
- with_function(fn, custom_template=None, **function_args)#
Define this recipe as being a functional recipe calling a function.
With the default template, the function must take as arguments:
A list of dataframes corresponding to the dataframes of the input datasets. If there is only one input, then a single dataframe
Optional named arguments corresponding to arguments passed to the creator as kwargs
The function should then return a list of dataframes, one per recipe output. If there is a single output, it is possible to return a single dataframe rather than a list.
- Parameters:
fn (string) – function to call
function_args (kwargs) – additional parameters to the function.
custom_template (string) – template to use to create the code of the recipe. The template is formatted with ‘{fname}’ (function name), ‘{module_name}’ (module name) and ‘{params_json}’ (JSON representation of function_args)
- class dataikuapi.dss.recipe.SQLQueryRecipeCreator(name, project)#
Create a SQL query recipe.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.PrepareRecipeCreator(name, project)#
Create a Prepare recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.SyncRecipeCreator(name, project)#
Create a Sync recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.SamplingRecipeCreator(name, project)#
Create a Sample/Filter recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.DistinctRecipeCreator(name, project)#
Create a Distinct recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.GroupingRecipeCreator(name, project)#
Create a Group recipe.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.- with_group_key(group_key)#
Set a column as the first grouping key.
Only a single grouping key may be set at recipe creation time. To add more grouping keys, get the recipe settings and use
GroupingRecipeSettings.add_grouping_key()
. To have no grouping keys at all, get the recipe settings and useGroupingRecipeSettings.clear_grouping_keys()
.- Parameters:
group_key (string) – name of a column in the input dataset
- Returns:
self
- Return type:
- class dataikuapi.dss.recipe.PivotRecipeCreator(name, project)#
Create a Pivot recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.SortRecipeCreator(name, project)#
Create a Sort recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.TopNRecipeCreator(name, project)#
Create a TopN recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.WindowRecipeCreator(name, project)#
Create a Window recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.JoinRecipeCreator(name, project)#
Create a Join recipe.
The recipe is created with default joins guessed by matching column names in the inputs.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.FuzzyJoinRecipeCreator(name, project)#
Create a FuzzyJoin recipe
The recipe is created with default joins guessed by matching column names in the inputs.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.GeoJoinRecipeCreator(name, project)#
Create a GeoJoin recipe
The recipe is created with default joins guessed by matching column names in the inputs.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.SplitRecipeCreator(name, project)#
Create a Split recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.StackRecipeCreator(name, project)#
Create a Stack recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.DownloadRecipeCreator(name, project)#
Create a Download recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
- class dataikuapi.dss.recipe.PredictionScoringRecipeCreator(name, project)#
Create a new Prediction scoring recipe.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.Usage example:
# Create a new prediction scoring recipe outputing to a new dataset project = client.get_project("MYPROJECT") builder = project.new_recipe("prediction_scoring", "my_scoring_recipe") builder.with_input_model("saved_model_id") builder.with_input("dataset_to_score") builder.with_new_output("my_output_dataset", "myconnection") # Or for a filesystem output connection # builder.with_new_output("my_output_dataset, "filesystem_managed", format="CSV_EXCEL_GZIP") new_recipe = builder.build()
- with_input_model(model_id)#
Set the input model.
- Parameters:
model_id (string) – identifier of a saved model
- class dataikuapi.dss.recipe.ClusteringScoringRecipeCreator(name, project)#
Create a new Clustering scoring recipe,.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.Usage example:
# Create a new prediction scoring recipe outputing to a new dataset project = client.get_project("MYPROJECT") builder = project.new_recipe("clustering_scoring", "my_scoring_recipe") builder.with_input_model("saved_model_id") builder.with_input("dataset_to_score") builder.with_new_output("my_output_dataset", "myconnection") # Or for a filesystem output connection # builder.with_new_output("my_output_dataset, "filesystem_managed", format="CSV_EXCEL_GZIP") new_recipe = builder.build()
- with_input_model(model_id)#
Set the input model.
- Parameters:
model_id (string) – identifier of a saved model
- class dataikuapi.dss.recipe.EvaluationRecipeCreator(name, project)#
Create a new Evaluate recipe.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.Usage example:
# Create a new evaluation recipe outputing to a new dataset, to a metrics dataset and/or to a model evaluation store project = client.get_project("MYPROJECT") builder = project.new_recipe("evaluation") builder.with_input_model(saved_model_id) builder.with_input("dataset_to_evaluate") builder.with_output("output_scored") builder.with_output_metrics("output_metrics") builder.with_output_evaluation_store(evaluation_store_id) new_recipe = builder.build() # Access the settings er_settings = new_recipe.get_settings() payload = er_settings.obj_payload # Change the settings payload['dontComputePerformance'] = True payload['outputProbabilities'] = False payload['metrics'] = ["precision", "recall", "auc", "f1", "costMatrixGain"] # Manage evaluation labels payload['labels'] = [dict(key="label_1", value="value_1"), dict(key="label_2", value="value_2")] # Save the settings and run the recipe er_settings.save() new_recipe.run()
Outputs must exist. They can be created using the following:
builder = project.new_managed_dataset("output_scored") builder.with_store_into(connection) dataset = builder.create() builder = project.new_managed_dataset("output_scored") builder.with_store_into(connection) dataset = builder.create() evaluation_store_id = project.create_model_evaluation_store("output_model_evaluation").mes_id
- with_input_model(model_id)#
Set the input model.
- Parameters:
model_id (string) – identifier of a saved model
- with_output(output_id)#
Set the output dataset containing the scored input.
- Parameters:
output_id (string) – name of the dataset, or identifier of the managed folder or identifier of the saved model
- with_output_metrics(name)#
Set the output dataset containing the metrics.
- Parameters:
name (string) – name of an existing dataset
- with_output_evaluation_store(mes_id)#
Set the output model evaluation store.
- Parameters:
mes_id (string) – identifier of a model evaluation store
- class dataikuapi.dss.recipe.StandaloneEvaluationRecipeCreator(name, project)#
Create a new Standalone Evaluate recipe.
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.Usage example:
# Create a new standalone evaluation of a scored dataset project = client.get_project("MYPROJECT") builder = project.new_recipe("standalone_evaluation") builder.with_input("scored_dataset_to_evaluate") builder.with_output_evaluation_store(evaluation_store_id) # Add a reference dataset (optional) to compute data drift builder.with_reference_dataset("reference_dataset") # Finish creation of the recipe new_recipe = builder.create() # Modify the model parameters in the SER settings ser_settings = new_recipe.get_settings() payload = ser_settings.obj_payload payload['predictionType'] = "BINARY_CLASSIFICATION" payload['targetVariable'] = "Survived" payload['predictionVariable'] = "prediction" payload['isProbaAware'] = True payload['dontComputePerformance'] = False # For a classification model with probabilities, the 'probas' section can be filled with the mapping of the class and the probability column # e.g. for a binary classification model with 2 columns: proba_0 and proba_1 class_0 = dict(key=0, value="proba_0") class_1 = dict(key=1, value="proba_1") payload['probas'] = [class_0, class_1] # Change the 'features' settings for this standalone evaluation # e.g. reject the features that you do not want to use in the evaluation feature_passengerid = dict(name="Passenger_Id", role="REJECT", type="TEXT") feature_ticket = dict(name="Ticket", role="REJECT", type="TEXT") feature_cabin = dict(name="Cabin", role="REJECT", type="TEXT") payload['features'] = [feature_passengerid, feature_ticket, feature_cabin] # To set the cost matrix properly, access the 'metricParams' section of the payload and set the cost matrix weights: payload['metricParams'] = dict(costMatrixWeights=dict(tpGain=0.4, fpGain=-1.0, tnGain=0.2, fnGain=-0.5)) # Save the recipe and run the recipe # Note that with this method, all the settings that were not explicitly set are instead set to their default value. ser_settings.save() new_recipe.run()
Output model evaluation store must exist. It can be created using the following:
evaluation_store_id = project.create_model_evaluation_store("output_model_evaluation").mes_id
- with_output_evaluation_store(mes_id)#
Set the output model evaluation store.
- Parameters:
mes_id (string) – identifier of a model evaluation store
- with_reference_dataset(dataset_name)#
Set the dataset to use as a reference in data drift computation.
- Parameters:
dataset_name (string) – name of a dataset
- class dataikuapi.dss.recipe.ContinuousSyncRecipeCreator(name, project)#
Create a continuous Sync recipe
Important
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_recipe()
instead.
Utilities#
- class dataikuapi.dss.utils.DSSComputedColumn#
- static formula(name, formula, type='double')#
Create a computed column with a formula.
- Parameters:
name (string) – a name for the computed column
formula (string) – formula to compute values, using the GREL language
type (string) – name of a DSS type for the values of the column
- Returns:
a computed column as a dict
- Return type:
dict
- class dataikuapi.dss.utils.DSSFilter#
Helper class to build filter objects for use in visual recipes.
- static of_single_condition(column, operator, string=None, num=None, date=None, time=None, date2=None, time2=None, unit=None)#
Create a simple filter on a column.
Which of the ‘string’, ‘num’, ‘date’, ‘time’, ‘date2’ and ‘time2’ parameter holds the literal to filter against depends on the filter operator.
- Parameters:
column (string) – name of a column to filter (left operand)
operator (string) – type of filter applied to the column, one of the values in the
DSSFilterOperator
enumstring (string) – string literal for the right operand
num (string) – numeric literal for the right operand
date (string) – date part literal for the right operand
time (string) – time part literal for the right operand
date2 (string) – date part literal for the right operand of BETWEEN_DATE
time2 (string) – time part literal for the right operand of BETWEEN_DATE
unit (string) – date/time rounding for date operations. Possible values are YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND
- static of_and_conditions(conditions)#
Create a filter as an intersection of conditions.
The resulting filter keeps rows that match all the conditions in the list. Conditions are for example the output of
condition()
.- Parameters:
conditions (list) – a list of conditions
- Returns:
a filter, as a dict
- Return type:
dict
- static of_or_conditions(conditions)#
Create a filter as an union of conditions.
The resulting filter keeps rows that match any of the conditions in the list. Conditions are for example the output of
condition()
.- Parameters:
conditions (list) – a list of conditions
- Returns:
a filter, as a dict
- Return type:
dict
- static of_formula(formula)#
Create a filter that applies a GREL formula.
The resulting filter evaluates the formula and keeps rows for which the formula returns a True value.
- Parameters:
formula (string) – a GREL formula
- Returns:
a filter, as a dict
- Return type:
dict
- static of_sql_expression(sql_expression)#
Create a filter that applies a SQL expression.
The resulting filter evaluates the sql expression and keeps rows for which the sql expression returns a True value.
- Parameters:
sql_expression (string) – a SQL expression
- Returns:
a filter, as a dict
- Return type:
dict
- static condition(column, operator, string=None, num=None, date=None, time=None, date2=None, time2=None, unit=None)#
Create a condition on a column for a filter.
Which of the ‘string’, ‘num’, ‘date’, ‘time’, ‘date2’ and ‘time2’ parameter holds the literal to filter against depends on the filter operator.
- Parameters:
column (string) – name of a column to filter (left operand)
operator (string) – type of filter applied to the column, one of the values in the
DSSFilterOperator
enumstring (string) – string literal for the right operand
num (string) – numeric literal for the right operand
date (string) – date part literal for the right operand
time (string) – time part literal for the right operand
date2 (string) – date part literal for the right operand of BETWEEN_DATE
time2 (string) – time part literal for the right operand of BETWEEN_DATE
unit (string) – date/time rounding for date operations. Possible values are YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND
- class dataikuapi.dss.utils.DSSFilterOperator(value)#
An enumeration.
- EMPTY_ARRAY = 'empty array'#
Test if an array is empty.
- NOT_EMPTY_ARRAY = 'not empty array'#
Test if an array is not empty.
- CONTAINS_ARRAY = 'array contains'#
Test if an array contains a value.
- NOT_EMPTY = 'not empty'#
Test if a value is not empty and not null.
- EMPTY = 'is empty'#
Test if a value is empty or null.
- NOT_EMPTY_STRING = 'not empty string'#
Test if a string is not empty.
- EMPTY_STRING = 'empty string'#
Test if a string is empty.
- IS_TRUE = 'true'#
Test if a boolean is true.
- IS_FALSE = 'false'#
Test if a boolean is false.
- EQUALS_STRING = '== [string]'#
Test if a string is equal to a given value.
- EQUALS_CASE_INSENSITIVE_STRING = '== [string]i'#
Test if a string is equal to a given value, ignoring case.
- NOT_EQUALS_STRING = '!= [string]'#
Test if a string is not equal to a given value.
- SAME = '== [NaNcolumn]'#
Test if two columns have the same value when formatted to string.
- DIFFERENT = '!= [NaNcolumn]'#
Test if two columns have different values when formatted to string.
- EQUALS_NUMBER = '== [number]'#
Test if a number is equal to a given value.
- NOT_EQUALS_NUMBER = '!= [number]'#
Test if a number is not equal to a given value.
- GREATER_NUMBER = '> [number]'#
Test if a number is greater than a given value.
- LESS_NUMBER = '< [number]'#
Test if a number is less than a given value.
- GREATER_OR_EQUAL_NUMBER = '>= [number]'#
Test if a number is greater or equal to a given value.
- LESS_OR_EQUAL_NUMBER = '<= [number]'#
Test if a number is less or equal to a given value.
- EQUALS_DATE = '== [date]'#
Test if a date/time is equal to a given date/time (rounded).
- GREATER_DATE = '> [date]'#
Test if a date/time is greater than a given date/time.
- GREATER_OR_EQUAL_DATE = '>= [date]'#
Test if a date/time is greater or equal than a given date/time.
- LESS_DATE = '< [date]'#
Test if a date/time is less than a given date/time.
- LESS_OR_EQUAL_DATE = '<= [date]'#
Test if a date/time is less or equal than a given date/time.
- BETWEEN_DATE = '>< [date]'#
Test if a date/time is between two given date/times.
- EQUALS_COL = '== [column]'#
Test if two columns have the same (typed) value.
- NOT_EQUALS_COL = '!= [column]'#
Test if two columns have different (typed) values.
- GREATER_COL = '> [column]'#
Test if one column is greater than another.
- LESS_COL = '< [column]'#
Test if one column is less than another.
- GREATER_OR_EQUAL_COL = '>= [column]'#
Test if one column is greater or equal than another.
- LESS_OR_EQUAL_COL = '<= [column]'#
Test if one column is less or equal than another.
- CONTAINS_STRING = 'contains'#
Test if a column contains a given string.
- REGEX = 'regex'#
Test if a column matches a regular expression.