Creating a plugin Processor component#

In this tutorial, you will learn how to create your own plugins by developing a preparation processor (or Prepare recipe step) that hides certain words in a dataset’s column.

Prerequisites#

This lesson assumes that you have a:

  • A basic understanding of coding in Dataiku.

  • A Dataiku version 12.0 or above (the free edition is compatible.)

  • A Python code environment that includes the matplotlib package.

    Note

    This tutorial was tested using a Python 3.8 code environment, other Python versions may be compatible.

Creating a plugin preparation processor component#

A preparation processor provides a visual user interface for implementing a Python function as a Prepare recipe step. The processor will hide certain words that appear in a dataset’s column.

Tip

Preparation processors work on rows independently; therefore, only Python functions that perform row implementations are valid. If your Python function performs column aggregates, for example, it won’t be a proper preparation processor.

Also, preparation in Dataiku should be interactive. Thus, the code that is executed should be fast.

To create a plugin preparation processor component:

  1. From the Application menu, select Plugins.

  2. Select Add Plugin > Write your own.

  3. Give the new plugin a name like hiding-words-processor and click Create.

  4. Click +Create Your First Component and select Preparation Processor.

  5. Give the new component an identifier and click Add.

The preparation processor plugin component comprises two files: a configuration file (processor.json) and a code file (processor.py). They are composed of code samples that need to be modified.

Editing the JSON descriptor#

The JSON file contains the metadata ("meta") and the parameters ("params") to be described by the user for the plugin. For our example, the JSON can be modified as follows:

processor.json#
/* This file is the descriptor for the Custom Python step: Hide Text */
{
    
    "meta" : {
        // label: name of the data prep step as displayed, should be short
        "label": "Hide text",

        // description: longer string to help end users understand what this data prep step does
        "description": "Hides words that appear in text.",

        // icon: must be one of the FontAwesome 3.2.1 icons, complete list here at https://fontawesome.com/v3.2.1/icons/
        "icon": "icon-asterisk"
    },

    /*
     * the processor mode, dictating what output is expected:
     * - CELL : the code outputs a value
     * - ROW : the code outputs a row
     * - ROWS : the code outputs an array of rows
     */
    "mode": "CELL",

    /* params:
    DSS will generate a formular from this list of requested parameters.
    Your component code can then access the value provided by users using the "name" field of each parameter.

    Available parameter types include:
    STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, MAP, PRESET and others.

    For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html
    */
    "params": [
        {
            "name": "input_column",
            "label": "Input column",
            "type": "COLUMN",
            "description": "Column containing the text to be processed.",
            "columnRole": "main",
            "mandatory": true
        }
    ],
    
    "useKernel" : true
    }

For more information on the customizations in the JSON file, see Component: Preparation Processor in the reference documentation.

Editing the Python code#

We want the processor to hide certain words that appear in our dataset in a case-insensitive manner. To do that, we can use a python functions that take each rows of the dataset as a parameter and returns the modified row.

processor.py#
# Note: this processor hides specific words appearing in text in a case-insensitive manner.
# You can fill the text_to_hide lists with the words you want to hide. 

def process(row):
    
    # List of colors to hide (ensure case-insensitive)
    text_to_hide = ["example_1", "example_2", "to_fill"] 
    text_to_hide = [w.casefold() for w in colors_to_hide]
    
    # Retrieve the user-defined input column
    text_column = params["input_column"]

    # Hide colors from list
    text_list = row[text_column].split(" ")
    text_list_hide = [w if w.casefold() not in text_to_hide else "****" for w in text_list]
    
    return " ".join(text_list_hide)

Test the preparation processor#

You can test the processor component from the Prepare recipe in the Flow.

  1. Refresh the Flow of your project.

  2. Open the Prepare recipe and click +Add a New Step.

  3. Begin to search for Hide text and select Hide text.

  4. Configure the processor as follows:

    • Output column: Hidden_Text_Column

    • Input column: Column_With_Text_To_Hide

  5. Scroll across the Preview page to view the output column Hidden_Text_Column with text hidden (that is, replaced with “****”).

  6. Run the Prepare recipe and Update Schema.

What’s next?#

In this lesson, you have learned how to create a custom preparation for recipes. You can check other tutorials on plugins to see how you can mutualize more components.

recipe.json
/* This file is the descriptor for the Custom Python step: Hide Text */
{
    
    "meta" : {
        // label: name of the data prep step as displayed, should be short
        "label": "Hide text",

        // description: longer string to help end users understand what this data prep step does
        "description": "Hides words that appear in text.",

        // icon: must be one of the FontAwesome 3.2.1 icons, complete list here at https://fontawesome.com/v3.2.1/icons/
        "icon": "icon-asterisk"
    },

    /*
     * the processor mode, dictating what output is expected:
     * - CELL : the code outputs a value
     * - ROW : the code outputs a row
     * - ROWS : the code outputs an array of rows
     */
    "mode": "CELL",

    /* params:
    DSS will generate a formular from this list of requested parameters.
    Your component code can then access the value provided by users using the "name" field of each parameter.

    Available parameter types include:
    STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, MAP, PRESET and others.

    For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html
    */
    "params": [
        {
            "name": "input_column",
            "label": "Input column",
            "type": "COLUMN",
            "description": "Column containing the text to be processed.",
            "columnRole": "main",
            "mandatory": true
        }
    ],
    
    "useKernel" : true
    }
recipe.py
# Note: this processor hides specific words appearing in text in a case-insensitive manner.
# You can fill the text_to_hide lists with the words you want to hide. 

def process(row):
    
    # List of colors to hide (ensure case-insensitive)
    text_to_hide = ["example_1", "example_2", "to_fill"] 
    text_to_hide = [w.casefold() for w in colors_to_hide]
    
    # Retrieve the user-defined input column
    text_column = params["input_column"]

    # Hide colors from list
    text_list = row[text_column].split(" ")
    text_list_hide = [w if w.casefold() not in text_to_hide else "****" for w in text_list]
    
    return " ".join(text_list_hide)