Creating a plugin Recipe component#

Prerequisites#

  • Dataiku >= 12.0

  • Access to a dataiku instance with the “Develop plugins” permissions

  • Access to an existing project with the following permissions:
    • “Read project content”

    • “Write project content”

Introduction#

Creating a custom recipe component is an easy way to add a new recipe to Dataiku via plugins. Doing so will make your recipe easily accessible alongside standard ones in the Plugin recipes section.

All recipes in Dataiku share the same essential function of taking data as input and producing new data as output. Recipes can operate with three types of data: datasets, managed folders or saved models. A recipe can even handle multiple types of data at once. Beyond this, recipes can be further customized to suit specific needs by requesting user input before running the recipe. Plugins are divided into a code file and configuration file.

Recipe creation#

You will create a recipe for this tutorial and convert it to a plugin component. As this tutorial focuses only on recipe creation, let’s re-use the default code provided by Dataiku when you create a Python recipe.

  1. Create the cars dataset by uploading this CSV file (or you can use any existing dataset).

  2. Select the Python recipe from the Code recipes action panel.

  3. Create a new output dataset named cars_copy.

  4. Then click on the Create Recipe button.

Let’s create a plugin from the this new recipe:

  1. Still in the recipe code editor, from the Actions panel, click Convert to plugin.

  2. Choose the most suitable option for you (either New plugin or Existing dev plugin),

  3. Enter plugin-datasets-copy in the Plugin id field.

  4. Enter datasets-copy in the New plugin recipe id field.

    Note

    Several recipes can be nested in one plugin. Choose the name so it can be clearly identified amongst other recipes.

  5. Click on the Convert button.

Configuration File: recipe.json#

Code 1 is the default configuration file for the custom recipe generated by Dataiku DSS. This file is divided into four sections: meta, inputRoles, outputRoles, and params.

  • The meta represents the global description of the recipe.

  • The inputRoles describes which input the recipe takes as parameters.

  • The outputRoles describes which output the recipe produces.

  • The params defines which parameters the recipe takes into consideration.

Code 1: recipe.json
Code 1: recipe.json#
// This file is the descriptor for the Custom code recipe datasets-copy
{
  // Meta data for display purposes
  "meta": {
    // label: name of the recipe as displayed, should be short
    "label": "Datasets copy",
    // description: longer string to help end users understand what this recipe does
    "description": "",
    // icon: must be one of the FontAwesome 3.2.1 icons, complete list here at https://fontawesome.com/v3.2.1/icons/
    "icon": "icon-puzzle-piece"
  },
  "kind": "PYTHON",
  // Inputs and outputs are defined by roles. In the recipe's I/O tab, the user can associate one
  // or more dataset to each input and output role.

  // The "arity" field indicates whether the user can associate several datasets to the role ('NARY')
  // or at most one ('UNARY'). The "required" field indicates whether the user is allowed to
  // associate no dataset with the role.

  "inputRoles": [
    {
      "name": "input_A_role",
      "label": "input A displayed name",
      "description": "what input A means",
      "arity": "UNARY",
      "required": true,
      "acceptsDataset": true
    },
    {
      "name": "input_B_role",
      "label": "input B displayed name",
      "description": "what input B means",
      "arity": "NARY",
      "required": false,
      "acceptsDataset": true
      // ,'mustBeSQL': true
      // ,'mustBeStrictlyType':'HDFS'
    }
    // ...
  ],
  "outputRoles": [
    {
      "name": "main_output",
      "label": "main output displayed name",
      "description": "what main output means",
      "arity": "UNARY",
      "required": false,
      "acceptsDataset": true
    },
    {
      "name": "errors_output",
      "label": "errors output displayed name",
      "description": "what errors output means",
      "arity": "UNARY",
      "required": false,
      "acceptsDataset": true
    }
    // ...
  ],
  /* The field "params" holds a list of all the params
     for wich the user will be prompted for values in the Settings tab of the recipe.

     The available parameter types include:
     STRING, STRINGS, INT, DOUBLE, BOOLEAN, SELECT, MULTISELECT, MAP, TEXTAREA, PRESET, COLUMN, COLUMNS

     For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html
  */

  "params": [
    {
      "name": "parameter1",
      "label": "User-readable label",
      "type": "STRING",
      "description": "Some documentation for parameter1",
      "mandatory": true
    },
    {
      "name": "parameter2",
      "type": "INT",
      "defaultValue": 42
      /* Note that standard json parsing will return it as a double in Python (instead of an int), so you need to write
         int(get_recipe_config()['parameter2'])
      */
    },
    // A "SELECT" parameter is a multi-choice selector. Choices are specified using the selectChoice field
    {
      "name": "parameter3",
      "type": "SELECT",
      "selectChoices": [
        {
          "value": "val_x",
          "label": "display name for val_x"
        },
        {
          "value": "val_y",
          "label": "display name for val_y"
        }
      ]
    },
    // A 'COLUMN' parameter is a string, whose value is a column name from an input schema.
    // To specify the input schema whose column names are used, use the "columnRole" field like below.
    // The column names will come from the schema of the first dataset associated to that role.
    {
      "name": "parameter4",
      "type": "COLUMN",
      "columnRole": "input_B_role"
    }

    // The 'COLUMNS' type works in the same way, except that it is a list of strings.
  ],
  // The field "resourceKeys" holds a list of keys that allows to limit the number
  // of concurrent executions and activities triggered by this recipe.
  //
  // Administrators can configure the limit per resource key in the Administration > Settings > Flow build
  // screen.

  "resourceKeys": []
}

Meta configuration#

In the "meta" section, you will define the following:

  • "description": the recipe’s purpose.

  • "icon": the icon to represent it.

  • "label": the name of the recipe

Using the definition shown in Code 2,

Code 2: recipe.json, "meta" section.#
    "meta": {
        "label": "Datasets copy",
        "description": "Duplicate a dataset",
        "icon": "icon-copy"
    },

Input and ouput configuration#

Each recipe can take some input. The "inputRoles" section is the place where you will define those inputs. As a recipe can take more than one input, "inputRoles" is an array representing those inputs. Each object in this array will represent one input.

For each object, you have to define the following:

  • "name": the name of the variable in the associated code.

  • "label": title for the input in the UI.

  • "description": description of what this input is for (displayed in the UI when the user requires it or in the “Run” screen).

  • "arity": choice between two values: "UNARY" or "NARY", meaning that this input is composed of one or multiple inputs.

  • "required": whether this input is required or not.

  • "acceptsDataset": whether this input takes a dataset as an input (optional, true by default).

  • "acceptsManagedFolder": whether this input takes a managed folder as an input (optional, false by default).

  • "acceptsSavedModel": whether a saved model can be used for this input (optional, false by default).

For example, if you need only one input, which is the case in our case, you should define your input like in Code 3.

Code 3: recipe.json, "inputRoles"" section.#
    "inputRoles": [
        {
            "name": "dataset_to_copy",
            "label": "Dataset to copy",
            "description": "Which dataset you want to copy",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],

The "outputRoles" section serves the same purpose as the "inputRoles" section, except it is dedicated to defining the recipe outputs.

Params configuration#

In this section, you will define all the needed parameters for your recipe to run. Refer to this tutorial which provides more information on the params section.

Complete code#

Code 4 shows the complete code of the configuration file for the recipe.

Code 4: recipe.json, "inputRoles"" section.#
{
    "meta": {
        "label": "Datasets copy",
        "description": "Duplicate a dataset",
        "icon": "icon-copy"
    },
    "inputRoles": [
        {
            "name": "dataset_to_copy",
            "label": "Dataset to copy",
            "description": "Which dataset you want to copy",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],
    "selectableFromDataset": "dataset_to_copy",
    "outputRoles": [
        {
            "name": "copied_dataset",
            "label": "Result dataset",
            "description": "This is where the dataset is copied.",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],
    "kind": "PYTHON",
    "params": [
    ],
    "resourceKeys": []

}

The highlighted "selectableFromDataset" parameter allows the recipe to appear in the plugin list, in the action panel, when selecting a dataset.

Warning

If you omit it, the user has to go into the +Recipe button menu to find your plugin, select it, and then select the custom recipe.

If you want your recipe to appear when selecting a managed folder or a saved model, you should use "selectableFromFolder" or "selectableFromSavedModel", respectively.

Note

You may change the "acceptsDataset": true parameter into the corresponding type of input in the "inputRoles" section.

Code File: recipe.py#

If you look at the generated code (Code 5), you will see two blocs. One is the starter code generated by Dataiku DSS, and the other bloc is your original recipe (the highlighted lines). You need to adapt your code to fit into the custom recipe pattern.

Code 5: Generated python sample recipe.py
Code 5: generated Python code#
# Code for custom code recipe datasets-copy (imported from a Python recipe)

# To finish creating your custom recipe from your original PySpark recipe, you need to:
#  - Declare the input and output roles in recipe.json
#  - Replace the dataset names by roles access in your code
#  - Declare, if any, the params of your custom recipe in recipe.json
#  - Replace the hardcoded params values by acccess to the configuration map

# See sample code below for how to do that.
# The code of your original recipe is included afterwards for convenience.
# Please also see the "recipe.json" file for more information.

# import the classes for accessing DSS objects from the recipe
import dataiku
# Import the helpers for custom recipes
from dataiku.customrecipe import get_input_names_for_role
from dataiku.customrecipe import get_output_names_for_role
from dataiku.customrecipe import get_recipe_config

# Inputs and outputs are defined by roles. In the recipe's I/O tab, the user can associate one
# or more dataset to each input and output role.
# Roles need to be defined in recipe.json, in the inputRoles and outputRoles fields.

# To  retrieve the datasets of an input role named 'input_A' as an array of dataset names:
input_A_names = get_input_names_for_role('input_A_role')
# The dataset objects themselves can then be created like this:
input_A_datasets = [dataiku.Dataset(name) for name in input_A_names]

# For outputs, the process is the same:
output_A_names = get_output_names_for_role('main_output')
output_A_datasets = [dataiku.Dataset(name) for name in output_A_names]

# The configuration consists of the parameters set up by the user in the recipe Settings tab.

# Parameters must be added to the recipe.json file so that DSS can prompt the user for values in
# the Settings tab of the recipe. The field "params" holds a list of all the params for wich the
# user will be prompted for values.

# The configuration is simply a map of parameters, and retrieving the value of one of them is simply:
my_variable = get_recipe_config()['parameter_name']

# For optional parameters, you should provide a default value in case the parameter is not present:
my_variable = get_recipe_config().get('parameter_name', None)

# Note about typing:
# The configuration of the recipe is passed through a JSON object
# As such, INT parameters of the recipe are received in the get_recipe_config() dict as a Python float.
# If you absolutely require a Python int, use int(get_recipe_config()["my_int_param"])


#############################
# Your original recipe
#############################

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
cars = dataiku.Dataset("cars")
cars_df = cars.get_dataframe()

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.

cars_copy_df = cars_df  # For this sample code, simply copy input to output

# Write recipe outputs
cars_copy = dataiku.Dataset("cars_copy")
cars_copy.write_with_schema(cars_copy_df)

The first thing you need to do is change your recipe’s input and output. Previously, your recipe acted on a known dataset. As you turn it into a custom recipe, you should get the user input defined in the "inputRoles" section of the recipe.json file.

In this file, you have defined dataset_to_copy as the name of the "inputRoles", so you should now extract the name of the dataset by using the code shown in Code 6.

Code 6: How to access to the dataset specified by the user.#
# To retrieve the datasets of an input role named 'dataset_to_copy' as an array of dataset names:
datasets_to_copy = get_input_names_for_role('dataset_to_copy')
# The two lines below show two different ways to access to the wanted dataset
# dataset_to_copy = [dataiku.Dataset(name) for name in datasets_to_copy][0]
dataset_to_copy = dataiku.Dataset(datasets_to_copy[0])

The function dataiku.customrecipe.get_input_names_for_role() returns an array (because the inputRole object can be defined as "NARY") of names defined by the user when filling the form (L. 2). As your inputRoles (dataset_to_copy) is "UNARY", you can assume that there will be only one name in the function dataiku.customrecipe.get_input_names_for_role() response (L. 5). So you can easily access to the dataset.

The same process can be applied to the “outputRoles”.

Once you have your two datasets (input and output), you must adapt your original code, as shown in Code 7.

Code 7: Complete code of the custom recipe.#
# import the classes for accessing DSS objects from the recipe
import dataiku
# Import the helpers for custom recipes
from dataiku.customrecipe import get_input_names_for_role
from dataiku.customrecipe import get_output_names_for_role
from dataiku.customrecipe import get_recipe_config

# To retrieve the datasets of an input role named 'dataset_to_copy' as an array of dataset names:
datasets_to_copy = get_input_names_for_role('dataset_to_copy')
# The two lines below show two different ways to access to the wanted dataset
# dataset_to_copy = [dataiku.Dataset(name) for name in datasets_to_copy][0]
dataset_to_copy = dataiku.Dataset(datasets_to_copy[0])

# For outputs, the process is the same:
copied_datasets = get_output_names_for_role('copied_dataset')
copied_dataset = [dataiku.Dataset(name) for name in copied_datasets][0]

# Using the input dataset
dataset_to_copy_df = dataset_to_copy.get_dataframe()

# Your algorithm
copied_dataset_df = dataset_to_copy_df

# Using the output dataset
copied_dataset.write_with_schema(copied_dataset_df)

Wrapping up#

You have completed this tutorial and built your first custom recipe. Understanding all these basic concepts allows you to create more complex custom recipes.

You can add a parameter to your recipe to only copy a subset of the initial dataset. This tutorial could provide helpful ideas and show a more complex custom recipe dealing with parameters and a valuable algorithm.

Here is the complete version of the code presented in this tutorial:

recipe.json
{
    "meta": {
        "label": "Datasets copy",
        "description": "Duplicate a dataset",
        "icon": "icon-copy"
    },
    "inputRoles": [
        {
            "name": "dataset_to_copy",
            "label": "Dataset to copy",
            "description": "Which dataset you want to copy",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],
    "selectableFromDataset": "dataset_to_copy",
    "outputRoles": [
        {
            "name": "copied_dataset",
            "label": "Result dataset",
            "description": "This is where the dataset is copied.",
            "arity": "UNARY",
            "required": true,
            "acceptsDataset": true
        }
    ],
    "kind": "PYTHON",
    "params": [
    ],
    "resourceKeys": []

}
recipe.py
# import the classes for accessing DSS objects from the recipe
import dataiku
# Import the helpers for custom recipes
from dataiku.customrecipe import get_input_names_for_role
from dataiku.customrecipe import get_output_names_for_role
from dataiku.customrecipe import get_recipe_config

# To retrieve the datasets of an input role named 'dataset_to_copy' as an array of dataset names:
datasets_to_copy = get_input_names_for_role('dataset_to_copy')
# The two lines below show two different ways to access to the wanted dataset
# dataset_to_copy = [dataiku.Dataset(name) for name in datasets_to_copy][0]
dataset_to_copy = dataiku.Dataset(datasets_to_copy[0])

# For outputs, the process is the same:
copied_datasets = get_output_names_for_role('copied_dataset')
copied_dataset = [dataiku.Dataset(name) for name in copied_datasets][0]

# Using the input dataset
dataset_to_copy_df = dataset_to_copy.get_dataframe()

# Your algorithm
copied_dataset_df = dataset_to_copy_df

# Using the output dataset
copied_dataset.write_with_schema(copied_dataset_df)