Building and deploying machine learning (ML) models is a cornerstone of most data science projects, and Dataiku provides a comprehensive set of features to ease and speed up these operations. While the platform offers a wide range of visual capabilities, it also exposes numerous programmatic elements for anyone who wants to handle their model’s lifecycle using code.
The first step of the machine learning process is to fit a model using training data. It is an experimental phase during which you can to test various combinations of pre-processing, algorithms, and parameters. The process of running such trials and logging their results is called experiment tracking; it is implemented natively in Dataiku so that you can use a variety of ML frameworks to train models and log their performance and characteristics.
Under the hood, Dataiku uses MLflow models as a standardized format to package models.
In some cases, training a model can require much time and computing resources, so you may prefer to bring in an existing pre-trained model and perform subsequent operations in the Dataiku platform from there.
Several features can help speed up this process. You can either:
Retrieve and cache pre-trained models and embeddings provided by your ML framework of choice using code environment resources
Bring in model artifacts inside your Flow and store them in managed folders
You can fine-tune your models using experiment tracking or continue with evaluation and deployment.
Tutorials on pre-trained models
Evaluating a model involves computing a set of metrics to reflect how well it performs against a specific evaluation dataset.
In Dataiku, these metrics encompass the predictive power of the model, its explainability, and drift indicators. The values of those metrics are computed in a buildable Flow item called the “evaluation store” and are accessible either in their raw form using the public API or visually through a set of rich visualizations embedded in the Dataiku web interface.
Deployment and scoring#
The final step to make a model operational is to deploy it on a production infrastructure where it will be used to score incoming data. Depending on how the input data is expected to reach the model, Dataiku offers several deployment patterns:
If the model is meant to be queried via HTTP, Dataiku can package it as a REST API endpoint and take advantage of cloud-native infrastructures such as Kubernetes to ensure scalability and high-availability
For cases where larger data batches are expected to be processed and then scored, Dataiku allows the deployment of entire projects to production-ready instance types called Automation nodes.
Dataiku also offers flexible choices to pilot the deployment process, which can be executed using the platform’s native “Deployer” features or delegated to an external Continuous Integration/Delivery (CI/CD).
For specific cases where models need to be exported outside of Dataiku, you can generate standalone Python or Java artifacts. For more information, see the related documentation page.