Creating a plugin Prediction Algorithm component#

Creating a prediction algorithm component in a plugin allows you to extend the list of algorithms available in Dataiku’s Visual ML tool.

In this tutorial, you will create a Linear Discriminant Analysis as a plugin component.

Creating the plugin component#

From the Application menu, select Plugins.
Select Add Plugin > Write your own.
Give the new plugin a name like discriminant-analysis and click Create.
Click +Create Your First Component and select Prediction Algorithm.
Give the new component an identifier and click Add.

The ML Alogorithm plugin component is composed of two files: a configuration file (algo.json) and a code file (algo.py).

Editing `algo.json`#

First, let’s have a look at the algo.json file. Like every plugin, the first element of the JSON file is the "meta", in which you can detail all the metadata of your plugin. Making changes here helps to make the algorithm more straightforward to identify in the Visual ML tool.

"meta" : {

    "label": "Linear Discriminant Analysis",

    "description": "A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes' rule.",

    "icon": "fas fa-puzzle-piece"
},

The "predictionTypes" element must be changed according to the type of your problem. Discriminant analysis is the example here and can be used for classification problems. Hence, we choose the parameters as follows:

"predictionTypes": ["BINARY_CLASSIFICATION", "MULTICLASS"],

Dataiku uses the Grid Search to select hyperparameters. With the "gridSearchMode", you can select it to be either managed by Dataiku or set a custom searching strategy.

"gridSearchMode": "MANAGED",

The last important section of the JSON is the "params". For example, here are the parameters for the scikit-learn implementation of discriminant analysis. We can make these available in the JSON as follows.

"params": [
    {
        "name": "solver",
        "label": "Solver",
        "description": "Solver to use.",
        "type": "MULTISELECT",
        "defaultValue": ["svd"],
        "selectChoices": [
            {
                "value":"svd",
                "label":"Singular value decomposition"
            },
            {
                "value":"lsqr",
                "label":"Least squares"
            },
            {
                "value":"eigen",
                "label": "Eigenvalue decomposition"
            }
        ],
        "gridParam": true
    },
    {
        "name": "n_components",
        "label": "Number of components",
        "description":"Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. If None, will be set to min(n_classes - 1, n_features).",
        "type": "DOUBLES",
        "defaultValue": [1],
        "allowDuplicates": false,
        "gridParam": true
    },
    {
        "name": "tol",
        "label": "Tolerance",
        "description": "Threshold for rank estimation in SVD solver.",
        "type": "DOUBLES",
        "defaultValue": [0.0001],
        "allowDuplicates": false,
        "gridParam": true
    }
]

These parameters can be managed when creating a new model based on this plugin.

Editing `algo.py`#

Now, let’s edit algo.py. The default contents include an example of code for the AdaBoostRegressor algorithm. The code is a Python class composed of an init and get functions.

Remember to import the wanted algorithm and build it in the __init__() function to make it appropriate. For a linear discriminant analysis, you can do as follows:

from dataiku.doctor.plugins.custom_prediction_algorithm import BaseCustomPredictionAlgorithm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

class CustomPredictionAlgorithm(BaseCustomPredictionAlgorithm):
    def __init__(self, prediction_type=None, params=None):
        self.clf = LinearDiscriminantAnalysis()
        super(CustomPredictionAlgorithm, self).__init__(prediction_type, params)

    def get_clf(self):
        return self.clf

Using the component in a project#

Open a project and select a dataset.
Open a new Lab on the right panel and click on AutoML Prediction for a predictive model.
Select the feature you want to predict.
In the Algorithms panel of the Design of the predictive model, turn your plugin algorithm on.
Specify the settings and parameters you want, then click Train.

You can explore and deploy the resulting model in the same way you would any other model produced through the Visual ML tool.

Wrapping Up#

You can now create an ML learning algorithm and use it as a plugin in the Visual ML Tool in Dataiku.

The complete code can be found as follows:

algo.json

/* This file is the descriptor for the Custom Python Prediction algorithm ml-algo-test_linear-test */
{
    "meta" : {

        "label": "Linear Discriminant Analysis",

        "description": "A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes' rule.",

        "icon": "fas fa-puzzle-piece"
    },
    
    // List of types of prediction for which the algorithm will be available.
    // Possibles values are: ["BINARY_CLASSIFICATION", "MULTICLASS", "REGRESSION"]
    "predictionTypes": ["BINARY_CLASSIFICATION", "MULTICLASS"],

    // Depending on the mode you select, Dataiku will handle or not the building of the grid from the params
    // Possible values are ["NONE", "MANAGED", "CUSTOM"]
    "gridSearchMode": "MANAGED",

    // Whether the model supports or not sample weights for training. 
    // If yes, the clf from `algo.py` must have a `fit(X, y, sample_weights=None)` method
    // If not, sample weights are not applied on this algorithm, but if they are selected
    // for training, they will be applied on scoring metrics and charts.
    "supportsSampleWeights": true,

    // Whether the model supports sparse matrice for fitting and predicting, 
    // i.e. if the `clf` provided in `algo.py` accepts a sparse matrix as argument
    // for its `fit` and `predict` methods
    "acceptsSparseMatrix": false,

    /* params:
    Dataiku will generate a formular from this list of requested parameters.
    Your component code can then access the value provided by users using the "name" field of each parameter.

    Available parameter types include:
    STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, PRESET and others.

    Besides, if the parameters are to be used to build the grid search, you must add a `gridParam` field and set it to true.

    For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html

    Below is an example of parameters for an AdaBoost regressor from scikit learn.
    */
    "params": [
        {
            "name": "solver",
            "label": "Solver",
            "description": "Solver to use.",
            "type": "MULTISELECT",
            "defaultValue": ["svd"],
            "selectChoices": [
                {
                    "value":"svd",
                    "label":"Singular value decomposition"
                },
                {
                    "value":"lsqr",
                    "label":"Least squares"
                },
                {
                    "value":"eigen",
                    "label": "Eigenvalue decomposition"
                }
            ],
            "gridParam": true
        },
        {
            "name": "n_components",
            "label": "Number of components",
            "description":"Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. If None, will be set to min(n_classes - 1, n_features).",
            "type": "DOUBLES",
            "defaultValue": [1],
            "allowDuplicates": false,
            "gridParam": true
        },
        {
            "name": "tol",
            "label": "Tolerance",
            "description": "Threshold for rank estimation in SVD solver.",
            "type": "DOUBLES",
            "defaultValue": [0.0001],
            "allowDuplicates": false,
            "gridParam": true
        }
    ]
}