Creating a plugin Processor component#
In this tutorial, you will learn how to create your own plugins by developing a preparation processor (or Prepare recipe step) that hides certain words in a dataset’s column.
Prerequisites#
This lesson assumes that you have a:
A basic understanding of coding in Dataiku.
A Dataiku version 12.0 or above (the free edition is compatible.)
A Python code environment that includes the
matplotlib
package.Note
This tutorial was tested using a Python 3.8 code environment, other Python versions may be compatible.
Creating a plugin preparation processor component#
A preparation processor provides a visual user interface for implementing a Python function as a Prepare recipe step. The processor will hide certain words that appear in a dataset’s column.
Tip
Preparation processors work on rows independently; therefore, only Python functions that perform row implementations are valid. If your Python function performs column aggregates, for example, it won’t be a proper preparation processor.
Also, preparation in Dataiku should be interactive. Thus, the code that is executed should be fast.
To create a plugin preparation processor component:
From the Application menu, select Plugins.
Select Add Plugin > Write your own.
Give the new plugin a name like
hiding-words-processor
and click Create.Click +Create Your First Component and select Preparation Processor.
Give the new component an identifier and click Add.
The preparation processor plugin component comprises two files: a configuration file (processor.json
) and a code file (processor.py
).
They are composed of code samples that need to be modified.
Editing the JSON descriptor#
The JSON file contains the metadata ("meta"
) and the parameters ("params"
) to be described by the user for the plugin.
For our example, the JSON can be modified as follows:
/* This file is the descriptor for the Custom Python step: Hide Text */
{
"meta" : {
// label: name of the data prep step as displayed, should be short
"label": "Hide text",
// description: longer string to help end users understand what this data prep step does
"description": "Hides words that appear in text.",
// icon: must be one of the FontAwesome 3.2.1 icons, complete list here at https://fontawesome.com/v3.2.1/icons/
"icon": "icon-asterisk"
},
/*
* the processor mode, dictating what output is expected:
* - CELL : the code outputs a value
* - ROW : the code outputs a row
* - ROWS : the code outputs an array of rows
*/
"mode": "CELL",
/* params:
DSS will generate a formular from this list of requested parameters.
Your component code can then access the value provided by users using the "name" field of each parameter.
Available parameter types include:
STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, MAP, PRESET and others.
For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html
*/
"params": [
{
"name": "input_column",
"label": "Input column",
"type": "COLUMN",
"description": "Column containing the text to be processed.",
"columnRole": "main",
"mandatory": true
}
],
"useKernel" : true
}
For more information on the customizations in the JSON file, see Component: Preparation Processor in the reference documentation.
Editing the Python code#
We want the processor to hide certain words that appear in our dataset in a case-insensitive manner. To do that, we can use a python functions that take each rows of the dataset as a parameter and returns the modified row.
# Note: this processor hides specific words appearing in text in a case-insensitive manner.
# You can fill the text_to_hide lists with the words you want to hide.
def process(row):
# List of colors to hide (ensure case-insensitive)
text_to_hide = ["example_1", "example_2", "to_fill"]
text_to_hide = [w.casefold() for w in colors_to_hide]
# Retrieve the user-defined input column
text_column = params["input_column"]
# Hide colors from list
text_list = row[text_column].split(" ")
text_list_hide = [w if w.casefold() not in text_to_hide else "****" for w in text_list]
return " ".join(text_list_hide)
Test the preparation processor#
You can test the processor component from the Prepare recipe in the Flow.
Refresh the Flow of your project.
Open the Prepare recipe and click +Add a New Step.
Begin to search for
Hide text
and select Hide text.Configure the processor as follows:
Output column:
Hidden_Text_Column
Input column:
Column_With_Text_To_Hide
Scroll across the Preview page to view the output column Hidden_Text_Column with text hidden (that is, replaced with “****”).
Run the Prepare recipe and Update Schema.
What’s next?#
In this lesson, you have learned how to create a custom preparation for recipes. You can check other tutorials on plugins to see how you can mutualize more components.
recipe.json
/* This file is the descriptor for the Custom Python step: Hide Text */
{
"meta" : {
// label: name of the data prep step as displayed, should be short
"label": "Hide text",
// description: longer string to help end users understand what this data prep step does
"description": "Hides words that appear in text.",
// icon: must be one of the FontAwesome 3.2.1 icons, complete list here at https://fontawesome.com/v3.2.1/icons/
"icon": "icon-asterisk"
},
/*
* the processor mode, dictating what output is expected:
* - CELL : the code outputs a value
* - ROW : the code outputs a row
* - ROWS : the code outputs an array of rows
*/
"mode": "CELL",
/* params:
DSS will generate a formular from this list of requested parameters.
Your component code can then access the value provided by users using the "name" field of each parameter.
Available parameter types include:
STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, MAP, PRESET and others.
For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html
*/
"params": [
{
"name": "input_column",
"label": "Input column",
"type": "COLUMN",
"description": "Column containing the text to be processed.",
"columnRole": "main",
"mandatory": true
}
],
"useKernel" : true
}
recipe.py
# Note: this processor hides specific words appearing in text in a case-insensitive manner.
# You can fill the text_to_hide lists with the words you want to hide.
def process(row):
# List of colors to hide (ensure case-insensitive)
text_to_hide = ["example_1", "example_2", "to_fill"]
text_to_hide = [w.casefold() for w in colors_to_hide]
# Retrieve the user-defined input column
text_column = params["input_column"]
# Hide colors from list
text_list = row[text_column].split(" ")
text_list_hide = [w if w.casefold() not in text_to_hide else "****" for w in text_list]
return " ".join(text_list_hide)