Creating a plugin Processor component#

In this tutorial, you will learn how to create your own plugins by developing a preparation processor (or Prepare recipe step) that hides certain words in a dataset’s column.

Prerequisites#

This lesson assumes that you have a:

  • A basic understanding of coding in Dataiku.

  • A Dataiku version 12.0 or above (the free edition is compatible.)

  • A Python code environment that includes the matplotlib package.

    Note

    This tutorial was tested using a Python 3.9 code environment, other Python versions may be compatible.

Creating a plugin preparation processor component#

A preparation processor provides a visual user interface for implementing a Python function as a Prepare recipe step. The processor will hide certain words that appear in a dataset’s column.

Tip

Preparation processors work on rows independently; therefore, only Python functions that perform row implementations are valid. If your Python function performs column aggregates, for example, it won’t be a proper preparation processor.

Also, preparation in Dataiku should be interactive. Thus, the code that is executed should be fast.

To create a plugin preparation processor component:

  1. From the Application menu, select Plugins.

  2. Select Add Plugin > Write your own.

  3. Give the new plugin a name like hiding-words-processor and click Create.

  4. Click +Create Your First Component and select Preparation Processor.

  5. Give the new component an identifier and click Add.

The preparation processor plugin component comprises two files: a configuration file (processor.json) and a code file (processor.py). They are composed of code samples that need to be modified.

Editing the JSON descriptor#

The JSON file contains the metadata ("meta") and the parameters ("params") to be described by the user for the plugin. For our example, the JSON can be modified as follows:

processor.json#
/* This file is the descriptor for the Custom Python step: Hide Text */
{
    
    "meta" : {
        // label: name of the data prep step as displayed, should be short
        "label": "Hide text",

        // description: longer string to help end users understand what this data prep step does
        "description": "Hides words that appear in text.",

        // icon: must be one of the FontAwesome 3.2.1 icons, complete list here at https://fontawesome.com/v3.2.1/icons/
        "icon": "icon-asterisk"
    },

    /*
     * the processor mode, dictating what output is expected:
     * - CELL : the code outputs a value
     * - ROW : the code outputs a row
     * - ROWS : the code outputs an array of rows
     */
    "mode": "CELL",

    /* params:
    Dataiku will generate a formular from this list of requested parameters.
    Your component code can then access the value provided by users using the "name" field of each parameter.

    Available parameter types include:
    STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, MAP, PRESET and others.

    For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html
    */
    "params": [
        {
            "name": "input_column",
            "label": "Input column",
            "type": "COLUMN",
            "description": "Column containing the text to be processed.",
            "columnRole": "main",
            "mandatory": true
        }
    ],
    
    "useKernel" : true
    }

For more information on the customizations in the JSON file, see Component: Preparation Processor in the reference documentation.

Editing the Python code#

We want the processor to hide certain words that appear in our dataset in a case-insensitive manner. To do that, we can use a python functions that take each rows of the dataset as a parameter and returns the modified row.

processor.py#
# Note: this processor hides specific words appearing in text in a case-insensitive manner.
# You can fill the text_to_hide lists with the words you want to hide. 

def process(row):

    # List of words to hide (ensure case-insensitive)
    text_to_hide = ["dodge", "peugeot", "volkswagen"]
    
    # Retrieve the user-defined input column
    text_column = params["input_column"]

    # Hide colors from list
    text_list = row[text_column].split(" ")
    text_list_hide = [w if w.casefold() not in text_to_hide else "****" for w in text_list]
    
    return " ".join(text_list_hide)

The highlighted line shows the list of words that the plugin will hide. Adapt it to your usage.

Testing the preparation processor#

First, prepare your dataset and the recipe:

  1. Create the cars dataset by uploading this CSV file.

  2. In the Flow, select the cars dataset, and click the Actions icon (+) in the right panel to open the Actions tab.

  3. Under the Visual Recipes section, click on Prepare. The New data preparation recipe window opens.

  4. Keep cars as the input dataset.

  5. Name the output dataset cars_prepared.

  6. Click Create Recipe.

Then, you can test the processor component from the Prepare recipe in the Flow.

  1. In the Prepare recipe, click +Add a New Step.

  2. Begin to search for Hide text and select Hide text.

  3. Configure the processor as follows:

    • Output column: name_with_hidden_words

    • Input column: name

  4. Scroll across the Preview page to view the output column Hidden_Text_Column with text hidden (that is, replaced with “****”).

  5. Run the Prepare recipe.

What’s next?#

In this lesson, you have learned how to create a custom preparation for recipes. You can check other tutorials on plugins to see how you can mutualize more components.

recipe.json
/* This file is the descriptor for the Custom Python step: Hide Text */
{
    
    "meta" : {
        // label: name of the data prep step as displayed, should be short
        "label": "Hide text",

        // description: longer string to help end users understand what this data prep step does
        "description": "Hides words that appear in text.",

        // icon: must be one of the FontAwesome 3.2.1 icons, complete list here at https://fontawesome.com/v3.2.1/icons/
        "icon": "icon-asterisk"
    },

    /*
     * the processor mode, dictating what output is expected:
     * - CELL : the code outputs a value
     * - ROW : the code outputs a row
     * - ROWS : the code outputs an array of rows
     */
    "mode": "CELL",

    /* params:
    Dataiku will generate a formular from this list of requested parameters.
    Your component code can then access the value provided by users using the "name" field of each parameter.

    Available parameter types include:
    STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, MAP, PRESET and others.

    For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html
    */
    "params": [
        {
            "name": "input_column",
            "label": "Input column",
            "type": "COLUMN",
            "description": "Column containing the text to be processed.",
            "columnRole": "main",
            "mandatory": true
        }
    ],
    
    "useKernel" : true
    }
recipe.py
# Note: this processor hides specific words appearing in text in a case-insensitive manner.
# You can fill the text_to_hide lists with the words you want to hide. 

def process(row):

    # List of words to hide (ensure case-insensitive)
    text_to_hide = ["dodge", "peugeot", "volkswagen"]
    
    # Retrieve the user-defined input column
    text_column = params["input_column"]

    # Hide colors from list
    text_list = row[text_column].split(" ")
    text_list_hide = [w if w.casefold() not in text_to_hide else "****" for w in text_list]
    
    return " ".join(text_list_hide)