Creating a sample dataset#

When starting a new project, users might have trouble finding or using datasets relevant to your company. To overcome these difficulties, Dataiku introduced a new plugin component, Sample Dataset, in version 14. This component allows you to provide datasets that can be used to start a project quickly.

Once a sample dataset has been created, every user can find it easily either on an empty flow by clicking the “Browse sample data” (Fig. 1) or in the flow view when clicking the +Dataset button and choosing the sample option in the flow (Fig. 2).

Whatever your choice, you will end to a modal that let the user select a sample dataset (Fig. 3). If you need to provide your users with a new sample dataset, you must develop a sample dataset component in an existing plugin (or a new one).

Figure 3: Browse provided sample dataset.#

This tutorial highlights the different actions needed to develop this component.

Prerequisites#

You have followed the Creating and configuring a plugin tutorial or already know how to develop a plugin.

Dataiku >= 14.0
“Develop plugins” permissions

Creating the plugin environment#

To create a sample dataset, you must first create a plugin. This documentation helps create and configure a plugin. Once the plugin is created, click the New component button and choose the Sample dataset component (Fig. 4).

Figure 4: New sample dataset component.#

Fill in the form by providing a unique identifier, pro-customer, for example, and click the Add button. This will redirect you to the plugin development environment within a folder containing the dataset.json (Code 1) configuration file and a folder (data) containing a sample.csv file. The former file configures the sample dataset, and the latter is where you put the data that you want to share.

Code 1: default configuration file – dataset.json#

// This file is the descriptor for the sample dataset template devadv-plugin-test
{
  "meta": {
    // label: name of the app template as displayed, should be short
    "label": "pro-customers",
    // description: longer string to help end users understand what this sample is
    "description": "",
    // icon: must be one of the FontAwesome 5.15.4 icons, complete list here at https://fontawesome.com/v5/docs/
    "icon": "fas fa-flask",
    // rowCount: Number of rows of your sample, optional
    // "rowCount": 100,

    // logo: optional displayed logo when selecting your sample
    // The logo should be located in the root of the plugin folder, inside a "resource" directory
    // For example: my-plugin/resource/my_logo.png
    // The logo image should be 280x200 pixels
    // The logo filename must only contain letters (a-z, A-Z), digits (0-9), dots (.), underscores (_), hyphens (-), and spaces ( )
    // The logo filename extension must be one of the following: ".apng", ".png", ".avif", ".gif", ".jpg", ".jpeg", ".jfif", ".svg", ".webp", ".bmp", ".ico", ".cur"
    // "logo": "my_logo.png",

    // displayOrderRank: number used to sort the various samples by descending order
    "displayOrderRank": 1
  },
  // Your data should be placed in a "data" directory where you can include your CSV files (e.g., "sample.csv") encoded in utf-8.
  // Your cells have to be separated by commas (,).
  // You can use double quotes (") as a quoting character to enclose cells containing the separator, and use backslash (\) as an escape character.
  // Additionally, you can use compressed files with the ".gz" extension for Gzip-compressed files (e.g., "sample.csv.gz")
  // You can also include multiple files or use zipped CSV files for convenience.

  // columns: schema columns of your sample
  // Each column must have:
  // - A unique "name"
  // - A "type" matching one of the available storage types
  //   - Available types include: string, int, double, boolean, dateonly, bigint and others : https://doc.dataiku.com/dss/latest/schemas/definitions.html#storage-types
  // Optional properties for each column:
  // - "comment": Description of the column
  // - "meaning": Can be user-defined or one of the recognized meanings
  //   - Recognized meanings include: Text, DoubleMeaning, LongMeaning, Boolean, DateOnly and others : https://doc.dataiku.com/dss/latest/schemas/meanings-list.html
  //
  // The following example is a sample schema for a heights dataset
  "columns": [
    {
      "name": "id",
      "type": "bigint",
      "comment": "Unique identifier"
    },
    {
      "name": "name",
      "type": "string",
      "comment": "Name of the person"
    },
    {
      "name": "size",
      "type": "double",
      "comment": "Height of the person, in meters",
      "meaning": "DoubleMeaning"
    }
  ]
}

Configuring the sample dataset#

To configure your sample dataset, modify the dataset.json file. This file is divided into two sections:

meta: configuration of the object sample dataset
columns: description of the different columns in your sample

In the meta section, you will find the usual fields (label, description, icon) and three specific optional fields (rowCount, logo, and displayOrderRank).

logo, label, and description are used to display information about your sample dataset when creating it, as shown in Fig. 5. If you want to provide a logo, you will need to create a resource folder at the root of your plugin, and upload your image to that folder.

Figure 5: Displaying information of your sample dataset.#

If you have filled in the rowCount field, it will also be used when the user glimpses the sample dataset (Fig. 6).

The displayOrderRank determines which position your sample dataset will be presented to the user.

Finally, the icon value is used when the dataset is visible in the flow (Fig. 7). If you don’t provide an icon, the plugin icon will be used. The default plugin icon will be used if there is no plugin icon.

Figure 7: New sample dataset in the flow.#

You will describe your data in the columns section, as stated in the default file.

Providing sample dataset data#

In the data folder of your sample dataset component, you can provide several CSV files, respecting the format you have described in the columns section. Each file will be used to provide data to the user. Dataiku will concatenate those files to build a unique dataset. You should not give a header line in your CSV file as the data are already described.

Dataiku will use the unique identifier you provided while creating the sample dataset component. For example, if you put the content shown in Code 2 as sample.csv, you will end up with a dataset that could be used as a starting point for the different tutorials on Agent (Building and using an agent with Dataiku’s LLM Mesh and Langchain, LLM Mesh agentic applications ).

Code 2: sample of data.#

tcook,Tim Cook,CEO,Apple
snadella,Satya Nadella,CEO,Microsoft
jbezos,Jeff Bezos,CEO,Amazon
fdouetteau,Florian Douetteau,CEO,Dataiku
wcoyote,Wile E. Coyote,Business Developer,ACME

Wrapping up#

Creating a sample dataset in Dataiku is an easy process that improves the user experience for new projects. Following this tutorial’s steps, you will build a sample dataset component tailored to your organization’s specific requirements. This allows quick access to relevant data, empowering users with the necessary resources to begin their analyses.

Here is the complete code of this tutorial: