Creating a plugin Prediction Algorithm component#

Creating a prediction algorithm component in a plugin allows you to extend the list of algorithms available in Dataiku’s Visual ML tool.

In this tutorial, you will create a Linear Discriminant Analysis as a plugin component.

Creating the plugin component#

  1. From the Application menu, select Plugins.

  2. Select Add Plugin > Write your own.

  3. Give the new plugin a name like discriminant-analysis and click Create.

  4. Click +Create Your First Component and select Prediction Algorithm.

  5. Give the new component an identifier and click Add.

The ML Alogorithm plugin component is composed of two files: a configuration file (algo.json) and a code file (algo.py).

Editing algo.json#

First, let’s have a look at the algo.json file. Like every plugin, the first element of the JSON file is the "meta", in which you can detail all the metadata of your plugin. Making changes here helps to make the algorithm more straightforward to identify in the Visual ML tool.

1"meta" : {
2
3    "label": "Linear Discriminant Analysis",
4
5    "description": "A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes' rule.",
6
7    "icon": "fas fa-puzzle-piece"
8},

The "predictionTypes" element must be changed according to the type of your problem. Discriminant analysis is the example here and can be used for classification problems. Hence, we choose the parameters as follows:

"predictionTypes": ["BINARY_CLASSIFICATION", "MULTICLASS"],

Dataiku uses the Grid Search to select hyperparameters. With the "gridSearchMode", you can select it to be either managed by Dataiku or set a custom searching strategy.

"gridSearchMode": "MANAGED",

The last important section of the JSON is the "params". For example, here are the parameters for the scikit-learn implementation of discriminant analysis. We can make these available in the JSON as follows.

 1"params": [
 2    {
 3        "name": "solver",
 4        "label": "Solver",
 5        "description": "Solver to use.",
 6        "type": "MULTISELECT",
 7        "defaultValue": ["svd"],
 8        "selectChoices": [
 9            {
10                "value":"svd",
11                "label":"Singular value decomposition"
12            },
13            {
14                "value":"lsqr",
15                "label":"Least squares"
16            },
17            {
18                "value":"eigen",
19                "label": "Eigenvalue decomposition"
20            }
21        ],
22        "gridParam": true
23    },
24    {
25        "name": "n_components",
26        "label": "Number of components",
27        "description":"Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. If None, will be set to min(n_classes - 1, n_features).",
28        "type": "DOUBLES",
29        "defaultValue": [1],
30        "allowDuplicates": false,
31        "gridParam": true
32    },
33    {
34        "name": "tol",
35        "label": "Tolerance",
36        "description": "Threshold for rank estimation in SVD solver.",
37        "type": "DOUBLES",
38        "defaultValue": [0.0001],
39        "allowDuplicates": false,
40        "gridParam": true
41    }
42]

These parameters can be managed when creating a new model based on this plugin.

Editing algo.py#

Now, let’s edit algo.py. The default contents include an example of code for the AdaBoostRegressor algorithm. The code is a Python class composed of an init and get functions.

Remember to import the wanted algorithm and build it in the __init__() function to make it appropriate. For a linear discriminant analysis, you can do as follows:

 1from dataiku.doctor.plugins.custom_prediction_algorithm import BaseCustomPredictionAlgorithm
 2from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 3
 4class CustomPredictionAlgorithm(BaseCustomPredictionAlgorithm):
 5    def __init__(self, prediction_type=None, params=None):
 6        self.clf = LinearDiscriminantAnalysis()
 7        super(CustomPredictionAlgorithm, self).__init__(prediction_type, params)
 8
 9    def get_clf(self):
10        return self.clf

Using the component in a project#

  1. Open a project and select a dataset.

  2. Open a new Lab on the right panel and click on AutoML Prediction for a predictive model.

  3. Select the feature you want to predict.

  4. In the Algorithms panel of the Design of the predictive model, turn your plugin algorithm on.

  5. Specify the settings and parameters you want, then click Train.

You can explore and deploy the resulting model in the same way you would any other model produced through the Visual ML tool.

Wrapping Up#

You can now create an ML learning algorithm and use it as a plugin in the Visual ML Tool in Dataiku.

The complete code can be found as follows:

algo.json
/* This file is the descriptor for the Custom Python Prediction algorithm ml-algo-test_linear-test */
{
    "meta" : {

        "label": "Linear Discriminant Analysis",

        "description": "A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes' rule.",

        "icon": "fas fa-puzzle-piece"
    },
    
    // List of types of prediction for which the algorithm will be available.
    // Possibles values are: ["BINARY_CLASSIFICATION", "MULTICLASS", "REGRESSION"]
    "predictionTypes": ["BINARY_CLASSIFICATION", "MULTICLASS"],

    // Depending on the mode you select, DSS will handle or not the building of the grid from the params
    // Possible values are ["NONE", "MANAGED", "CUSTOM"]
    "gridSearchMode": "MANAGED",

    // Whether the model supports or not sample weights for training. 
    // If yes, the clf from `algo.py` must have a `fit(X, y, sample_weights=None)` method
    // If not, sample weights are not applied on this algorithm, but if they are selected
    // for training, they will be applied on scoring metrics and charts.
    "supportsSampleWeights": true,

    // Whether the model supports sparse matrice for fitting and predicting, 
    // i.e. if the `clf` provided in `algo.py` accepts a sparse matrix as argument
    // for its `fit` and `predict` methods
    "acceptsSparseMatrix": false,

    /* params:
    DSS will generate a formular from this list of requested parameters.
    Your component code can then access the value provided by users using the "name" field of each parameter.

    Available parameter types include:
    STRING, INT, DOUBLE, BOOLEAN, DATE, SELECT, TEXTAREA, PRESET and others.

    Besides, if the parameters are to be used to build the grid search, you must add a `gridParam` field and set it to true.

    For the full list and for more details, see the documentation: https://doc.dataiku.com/dss/latest/plugins/reference/params.html

    Below is an example of parameters for an AdaBoost regressor from scikit learn.
    */
    "params": [
        {
            "name": "solver",
            "label": "Solver",
            "description": "Solver to use.",
            "type": "MULTISELECT",
            "defaultValue": ["svd"],
            "selectChoices": [
                {
                    "value":"svd",
                    "label":"Singular value decomposition"
                },
                {
                    "value":"lsqr",
                    "label":"Least squares"
                },
                {
                    "value":"eigen",
                    "label": "Eigenvalue decomposition"
                }
            ],
            "gridParam": true
        },
        {
            "name": "n_components",
            "label": "Number of components",
            "description":"Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. If None, will be set to min(n_classes - 1, n_features).",
            "type": "DOUBLES",
            "defaultValue": [1],
            "allowDuplicates": false,
            "gridParam": true
        },
        {
            "name": "tol",
            "label": "Tolerance",
            "description": "Threshold for rank estimation in SVD solver.",
            "type": "DOUBLES",
            "defaultValue": [0.0001],
            "allowDuplicates": false,
            "gridParam": true
        }
    ]
}
algo.py
# This file is the actual code for the custom Python algorithm ml-algo-test_linear-test
from dataiku.doctor.plugins.custom_prediction_algorithm import BaseCustomPredictionAlgorithm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

class CustomPredictionAlgorithm(BaseCustomPredictionAlgorithm):    
    """
        Class defining the behaviour of `ml-algo-test_linear-test` algorithm:
        - how it handles parameters passed to it
        - how the estimator works

        Example here defines an Adaboost Regressor from Scikit Learn that would work for regression
        (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html)

        You need to at least define a `get_clf` method that must return a scikit-learn compatible model

        Args:
            prediction_type (str): type of prediction for which the algorithm is used. Is relevant when 
                                   algorithm works for more than one type of prediction.
                                   Possible values are: "BINARY_CLASSIFICATION", "MULTICLASS", "REGRESSION"
            params (dict): dictionary of params set by the user in the UI.
    """
    
    def __init__(self, prediction_type=None, params=None):        
        self.clf = LinearDiscriminantAnalysis()
        super(CustomPredictionAlgorithm, self).__init__(prediction_type, params)
    
    def get_clf(self):
        """
        This method must return a scikit-learn compatible model, ie:
        - have a fit(X,y) and predict(X) methods. If sample weights
          are enabled for this algorithm (in algo.json), the fit method
          must have instead the signature fit(X, y, sample_weight=None)
        - have a get_params() and set_params(**params) methods
        """
        return self.clf