Datasets#

Please see Datasets for an introduction to interacting with datasets in Dataiku Python API

The dataiku.Dataset class#

class dataiku.Dataset(name, project_key=None, ignore_flow=False)#

Provides a handle to obtain readers and writers on a dataiku Dataset. From this Dataset class, you can:

  • Read a dataset as a Pandas dataframe

  • Read a dataset as a chunked Pandas dataframe

  • Read a dataset row-by-row

  • Write a pandas dataframe to a dataset

  • Write a series of chunked Pandas dataframes to a dataset

  • Write to a dataset row-by-row

  • Edit the schema of a dataset

Parameters:
  • name (str) – The name of the dataset.

  • project_key (str) – The key of the project in which the dataset is located (current project key if none is specified)

  • ignore_flow (boolean) – this parameter is only relevant for recipes, not for notebooks or code in metrics or scenario steps. when in a recipe, if it’s left to False, then DSS also checks whether the dataset is part of the inputs or outputs of the recipe and raises an error if it’s not, defaults to False

Returns:

a handle to interact with the dataset

Return type:

Dataset

static list(project_key=None)#

List the names of datasets of a given project.

Usage example:

import dataiku

# current project datasets
current_project_datasets = dataiku.Dataset.list()

# given project datasets
my_project_datasets =  dataiku.Dataset.list("my_project")
Parameters:

project_key (str) – the optional key of the project to retrieve the datasets from, defaults to current project

Returns:

a list of a dataset names

Return type:

list[str]

property full_name#

Get the fully-qualified identifier of the dataset on the DSS instance.

Returns:

a fully qualified identifier for the dataset in the form “project_key.dataset_name”

Return type:

str

get_location_info(sensitive_info=False)#

Retrieve the location information of the dataset.

Usage example

# save a dataframe to csv with fixed name to S3
dataset = dataiku.Dataset("my_target_dataset")
location_info = dataset.get_location_info(True)

s3_folder = location_info["info"]["path"] # get URI of the dataset
import re
# extract the bucket from the URI
s3_bucket = re.match("^s3://([^/]+)/.*$", s3_folder).group(1)
# extract path inside bucket
s3_path_in_bucket = re.match("^s3://[^/]+/(.*)$", s3_folder).group(1)

# save to S3 using boto
from io import StringIO
import boto3
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(s3_bucket, s3_path_in_bucket + '/myfile.csv').put(Body=csv_buffer.getvalue())
Parameters:

sensitive_info (boolean) – whether or not to provide sensitive infos such as passwords, conditioned on the user being allowed to read details of the connection on which this dataset is defined

Returns:

a dict with the location info, with as notable fields:

  • locationInfoType: type of location. Possible values are ‘FS’, ‘HDFS’, ‘UPLOADED’, ‘SQL’

  • info : a dict whose structure depends on the type of connection

    • connectionName: connection name, if any

    • connectionParams : parameters of the connection on which the dataset is defined, as a dic, if any. The actual fields depend on the connection type. For S3 dataset, this will for example contain the bucket and credentials.

    • path : the URI of the dataset, if any

Return type:

dict

get_files_info(partitions=[])#

Get information on the files of the dataset, with details per partition.

Parameters:

partitions (list[str], optional) – list of partition identifiers, defaults to all partitions

Returns:

global files information and per partitions

  • globalPaths: list of files of the dataset.

    • path: file path

    • lastModified: timestamp of last file update, in milliseconds

    • size: size of the file, in bytes

  • pathsByPartition: files grouped per partition, as a dict of partition identifier to list of files (same structure as globalPaths)

Return type:

dict

set_write_partition(spec)#

Set which partition of the dataset gets written to when you create a DatasetWriter.

Caution

Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.

Parameters:

spec (string) – partition identifier

add_read_partitions(spec)#

Add a partition or range of partitions to read.

Caution

You cannot manually add read partitions when running inside a Python recipe. They are automatically computed according to the partition dependencies defined on the recipe’s Input/Output tab.

Parameters:

spec (string) – partition spec, or partition identifier

read_schema(raise_if_empty=True)#

Get the schema of this dataset, as an array of column definition.

Parameters:

raise_if_empty (bool, optional) – raise an exception if there is no column, defaults to True

Returns:

list of column definitions

Return type:

dataiku.core.dataset.Schema

list_partitions(raise_if_empty=True)#

List the partitions of this dataset, as an array of partition identifiers.

Usage example

# build a list of partitions for use in a build/train step in a scenario
dataset = dataiku.Dataset("some_input_dataset")
partitions = dataset.list_partitions()
variable_value = ','.join(partitions)

# set as a variable, to use in steps after this one
Scenario().set_scenario_variables(som_variable_name=variable_value)
Parameters:

raise_if_empty (bool, optional) – raise an exception if there is no partition, defaults to True

Returns:

list of partitions identifiers

Return type:

list[string]

set_preparation_steps(steps, requested_output_schema, context_project_key=None)#

Set preparation steps.

Caution

for internal use

Parameters:
  • steps (list) – list of steps

  • requested_output_schema (dict) – output schema with a key columns containing a list of columns definition (name, type, …)

  • context_project_key (string, optional) – context project key, defaults to None

get_fast_path_dataframe(auto_fallback=False, columns=None, pandas_read_kwargs=None, print_deep_memory_usage=True)#

Reads the dataset as a Pandas dataframe, using fast-path access (without going through DSS), if possible.

Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.

The fast path method provides better performance than the usual get_dataframe() method, but is only compatible with some dataset types and formats.

Fast path requires the “permission details readable” to be granted on the connection.

Dataframes obtained using this method may differ from those using get_dataframe(), notably around schemas and data. get_dataframe() provides a unified API with the same schema and data for all connections. On the other hand, this method uses dataset-specific access patterns that may yield different results.

At the moment, this fast path is available for:

  • S3 datasets using Parquet. This requires the additional s3fs package, as well as fastparquet or pyarrow

  • Snowflake datasets. This requires the additional snowflake-connector-python[pandas] package

Parameters:
  • columns (list) – List of columns to read, or None for all columns

  • auto_fallback (boolean) – If fast path is impossible and auto_fallback is True, then a regular get_dataframe() call will be used. If auto_fallback is False, this method will fail

  • print_deep_memory_usage (bool) – After reading the dataframe, Dataiku prints the memory usage of the dataframe. When this is enabled, this will provide the accurate memory usage, including for string columns. This can have a small performance impact. Defaults to True

  • pandas_read_kwargs (dict) – For the case where the read is mediated by a call to pd.read_parquet, arguments to pass to the read_parquet function

get_dataframe(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, ascending=True, infer_with_pandas=True, parse_dates=True, bool_as_str=False, int_as_float=False, use_nullable_integers=False, categoricals=None, float_precision=None, na_values=None, keep_default_na=True, print_deep_memory_usage=True, skip_additional_data_checks=False, date_parser=None, override_dtypes=None, pandas_read_kwargs=None)#

Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.

Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.

# read some dataset and print its shape
dataset = dataiku.Dataset("the_dataset_name")
df = dataset.get_dataframe()
print("Number of rows: %s" df.shape[0])
print("Number of columns: %s" df.shape[1])
Parameters:
  • columns (list) – when not None, returns only columns from the given list. defaults to None

  • limit (integer) – limits the number of rows returned, defaults to None

  • sampling – sampling method, see dataiku.core.dataset.create_sampling_argument(). Defaults to ‘head’.

  • sampling_column (string) – column used for “random-column” and “sort-column” sampling, defaults to None

  • ratio (float) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count

  • boolean (ascending) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True

  • infer_with_pandas (bool) – uses the types detected by pandas rather than the dataset schema as detected in DSS, defaults to True

  • parse_dates (bool) – Only used when infer_with_pandas is False. Parses date column in DSS schema. Defaults to True

  • bool_as_str (bool) – Only used when infer_with_pandas is False. Leaves boolean values as string. Defaults to False

  • int_as_float (bool) – Only used when infer_with_pandas is False. Leaves int values as floats. Defaults to False

  • use_nullable_integers (bool) – Only used when infer_with_pandas is False. Use pandas nullable integer types, which allows missing values in integer columns

  • categoricals – Only used when infer_with_pandas is False. What columns to read as categoricals. This is particularly efficient for columns with low cardinality. Can be either “all_strings” to read all string columns as categorical, or a list of column names to read as categoricals

  • float_precision (string) – set Pandas converter, can be None, ‘high’, ‘legacy’ or ‘round_trip’, defaults to None. see Pandas.read_table documentation for more information

  • na_values (string/list/dict) –

    additional strings to recognize as NA/NaN, defaults to None. see Pandas.read_table documentation for more information

  • keep_default_na (bool) –

    whether or not to include the default NaN values when parsing the data, defaults to True. see Pandas.read_table documentation for more information

  • date_parser (function) –

    function to use for converting a sequence of string columns to an array of datetime instances, defaults to None. see Pandas.read_table documentation for more information

  • skip_additional_data_checks (bool) – Skip some data type checks. Enabling this can lead to strongly increased performance (up to x3). It is usually safe to enable this. Default to False

  • print_deep_memory_usage (bool) – After reading the dataframe, Dataiku prints the memory usage of the dataframe. When this is enabled, this will provide the accurate memory usage, including for string columns. This can have a small performnace impact. Defaults to True

  • override_dtypes (dict) – If not None, overrides dtypes computed from schema. Defaults to None

  • pandas_read_kwargs (dict) – If not None, additional kwargs passed to pd.read_table. Defaults to None

Returns:

a Pandas dataframe object

Return type:

pandas.core.frame.DataFrame

to_html(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, apply_conditional_formatting=True, header=True, classes='', border=0, null_string='', indent_string=None, filter_expression=None)#

Render the dataset as an html table.

HTML tables are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this, or pass a value to the limit parameter.

# read some dataset and displays the first 50 rows
dataset = dataiku.Dataset("the_dataset_name")
df = dataset.to_html(limit=50)
Parameters:
  • columns (list[str]) – when not None, returns only columns from the given list. Defaults to None

  • sampling – sampling method, see dataiku.core.dataset.create_sampling_argument(). Defaults to ‘head’.

  • sampling_column (string) – column used for “random-column” and “sort-column” sampling, defaults to None

  • limit (integer) – limits the number of rows returned, defaults to None

  • ratio (float) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count

  • apply_conditional_formatting (bool) – true to apply conditional formatting as it has been defined in DSS Explore view

  • header (bool) – Whether to print column labels, default True.

  • classes (str or list[str]) – Name of the CSS class attached to TABLE tag in the generated HTML (or multiple classes as a list).

  • border (int) – A border attribute of the specified size is included in the opening <table> tag. Default to 0

  • null_string (str) – string to represent null values. Defaults to an empty string.

  • indent_string (str) – characters to use to indent the formatted HTML. If None or empty string, no indentation and no carriage return line feed. Defaults to None

  • filter_expression (str) – expression used to filter data using formula language, defaults to None. Not supported on datasets with preparation steps.

Returns:

an HTML representation of the dataset

Return type:

str

static get_dataframe_schema_st(schema, columns=None, parse_dates=True, infer_with_pandas=False, bool_as_str=False, int_as_float=False, use_nullable_integers=False, categoricals=None)#

Extract information for Pandas from a schema.

See get_dataframe() for explanation of the other parameters

Parameters:

schema (list[dict]) – a schema definition as returned by read_schema()

Returns:

a list of 3 items:

  • a list columns names

  • a dict of columns Numpy data types by names

  • a list of the indexes of the dates columns or False

Return type:

tuple[list,dict,list]

iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None, na_values=None, keep_default_na=True, date_parser=None, ascending=True, pandas_read_kwargs=None)#

Read the dataset to Pandas dataframes by chunks of fixed size with given data types.

import dataiku

dataset = dataiku.Dataset("my_dataset")
[names, dtypes, parse_date_columns] = dataiku.Dataset.get_dataframe_schema_st(dataset.read_schema())
chunk = 0
chunksize = 1000
headsize = 5
for df in dataset.iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize = chunksize):
    print("> chunk #", chunk, "- first", headsize, "rows of", df.shape[0])
    chunk += 1
    print(df.head(headsize))
Parameters:
  • names (list[string]) – list of column names

  • dtypes (dict) – dict of data types by columns name

  • parse_date_columns (list) – a list of the indexes of the dates columns or False

  • chunksize (int, optional) – chunk size, defaults to 10000

  • limit (integer) – limits the number of rows returned, defaults to None

  • sampling (str, optional) – sampling method, see dataiku.core.dataset.create_sampling_argument(). Defaults to ‘head’.

  • sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None

  • ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count

  • float_precision (string, optional) –

    set Pandas converter, can be None, ‘high’, ‘legacy’ or ‘round_trip’, defaults to None. see Pandas.read_table documentation for more information

  • na_values (string/list/dict, optional) –

    additional strings to recognize as NA/NaN, defaults to None. see Pandas.read_table documentation for more information

  • keep_default_na (bool, optional) –

    whether or not to include the default NaN values when parsing the data, defaults to True. see Pandas.read_table documentation for more information

  • date_parser (function, optional) –

    function to use for converting a sequence of string columns to an array of datetime instances, defaults to None. see Pandas.read_table documentation for more information

  • ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True

  • pandas_read_kwargs (dict) – If not None, additional kwargs passed to pd.read_table. Defaults to None

Yield:

pandas.core.frame.DataFrame

Return type:

generator

iter_dataframes(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, int_as_float=False, use_nullable_integers=False, categoricals=None, float_precision=None, na_values=None, keep_default_na=True, ascending=True, pandas_read_kwargs=None)#

Read the dataset to Pandas dataframes by chunks of fixed size.

Tip

Useful is the dataset doesn’t fit in RAM

import dataiku

dataset = dataiku.Dataset("my_dataset")
for df in dataset.iter_dataframes(chunksize = 5000):
    print("> chunk of", df.shape[0], "rows")
    print(df.head(headsize))
Parameters:
  • chunksize (int, optional) – chunk size, defaults to 10000

  • infer_with_pandas (bool, optional) – use the types detected by pandas rather than the dataset schema as detected in DSS, defaults to True

  • limit (int, optional) – limits the number of rows returned, defaults to None

  • sampling (str, optional) – sampling method, see dataiku.core.dataset.create_sampling_argument(). Defaults to ‘head’.

  • sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None

  • parse_dates (bool, optional) – date column in DSS’s dataset schema are parsed, defaults to True

  • limit – set the sampling max rows count, defaults to None

  • ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count

  • columns (list[str], optional) – specify the desired columns, defaults to None (all columns)

  • bool_as_str (bool, optional) – Only used when infer_with_pandas is False. Leaves boolean values as strings, defaults to False

  • int_as_float (bool, optional) – Only used when infer_with_pandas is False. Leaves int values as floats. Defaults to False

  • use_nullable_integers (bool, optional) – Only used when infer_with_pandas is False. Use pandas nullable integer types, which allows missing values in integer columns. Defaults to False

  • categoricals (string/list, optional) – Only used when infer_with_pandas is False. What columns to read as categoricals. This is particularly efficient for columns with low cardinality. Can be either “all_strings” to read all string columns as categorical, or a list of column names to read as categoricals

  • float_precision (string, optional) –

    set Pandas converter, can be None, ‘high’, ‘legacy’ or ‘round_trip’, defaults to None. see Pandas.read_table documentation for more information

  • na_values (string/list/dict, optional) –

    additional strings to recognize as NA/NaN, defaults to None. see Pandas.read_table documentation for more information

  • keep_default_na (bool, optional) –

    whether or not to include the default NaN values when parsing the data, defaults to True. see Pandas.read_table documentation for more information

  • ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True

  • pandas_read_kwargs (dict) – If not None, additional kwargs passed to pd.read_table. Defaults to None

Yield:

pandas.core.frame.DataFrame

Return type:

generator

write_with_schema(df, drop_and_create=False, **kwargs)#

Write a pandas dataframe to this dataset (or its target partition, if applicable).

This variant replaces the schema of the output dataset with the schema of the dataframe.

Caution

strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

Note

the dataset must be writable, ie declared as an output, except if you instantiated the dataset with ignore_flow=True

import dataiku
from dataiku import recipe

# simply copy the first recipe input dataset
# to the first recipe output dataset, with the schema

ds_input = recipe.get_inputs()[0]
df_input = ds_input.get_dataframe()
ds_output = recipe.get_outputs()[0]
ds_output.write_with_schema(df_input, True)
Parameters:
  • df (pandas.core.frame.DataFrame) – a panda dataframe

  • drop_and_create (bool, optional) – whether to drop and recreate the dataset, defaults to False

  • dropAndCreate (bool, optional) – deprecated, use drop_and_create

write_dataframe(df, infer_schema=False, drop_and_create=False, **kwargs)#

Write a pandas dataframe to this dataset (or its target partition, if applicable).

This variant only edits the schema if infer_schema is True, otherwise you need to only write dataframes that have a compatible schema. Also see write_with_schema().

Caution

strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

Note

the dataset must be writable, ie declared as an output, except if you instantiated the dataset with ignore_flow=True

Parameters:
  • df (pandas.core.frame.DataFrame) – a pandas dataframe

  • infer_schema (bool, optional) – whether to infer the schema from the dataframe, defaults to False

  • drop_and_create (bool, optional) – whether to drop and recreate the dataset, defaults to False

  • dropAndCreate (bool, optional) – deprecated, use drop_and_create

iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None, ascending=True)#

Get a generator of rows (as a dict-like object) in the data (or its selected partitions, if applicable).

Values are cast according to their types. String are parsed into “unicode” values.

Parameters:
  • limit (int, optional) – limits the number of rows returned, defaults to None

  • sampling (str, optional) – sampling method, see dataiku.core.dataset.create_sampling_argument(). Defaults to ‘head’.

  • sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None

  • limit – maximum number of rows to be emitted, defaults to None

  • ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count

  • log_every (int, optional) – print out the number of rows read on stdout, defaults to -1 (no log)

  • timeout (int, optional) – set a timeout in seconds, defaults to 30

  • columns (list[str], optional) – specify the desired columns, defaults to None (all columns)

  • ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True

Yield:

dataiku.core.dataset.DatasetCursor

Return type:

generator

raw_formatted_data(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None, read_session_id=None, filter_expression=None)#

Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported DSS output format.

Caution

You MUST close the file handle. Failure to do so will result in resource leaks.

After closing, you can also call verify_read() to check for any errors that occurred while reading the dataset data.

import uuid
import dataiku
from dataiku.core.dataset import create_sampling_argument

dataset = dataiku.Dataset("customers_partitioned")
read_session_id = str(uuid.uuid4())
sampling = create_sampling_argument(sampling='head', limit=5)
resp = dataset.raw_formatted_data(sampling=sampling, format="json", read_session_id=read_session_id)
print(resp.data)
resp.close()
dataset.verify_read(read_session_id) #throw an exception if the read hasn't been fully completed
print("read completed successfully")
Parameters:
  • sampling (dict, optional) – a dict of sampling specs, see dataiku.core.dataset.create_sampling_argument(), defaults to None

  • columns (list[str], optional) – list of desired columns, defaults to None (all columns)

  • format (str, optional) – output format, defaults to “tsv-excel-noheader”. Supported formats are : “json”, “tsv-excel-header” (tab-separated with header) and “tsv-excel-noheader” (tab-separated without header)

  • format_params (dict, optional) – dict of output format parameters, defaults to None

  • read_session_id (str, optional) – identifier of the read session, used to check at the end if the read was successful, defaults to None

  • filter_expression (str, optional) – expression used to filter data using formula language, defaults to None

Returns:

an HTTP response

Return type:

urllib3.response.HTTPResponse

verify_read(read_session_id)#

Verify that no error occurred when using raw_formatted_data() to read a dataset.

Use the same read_session_id that you passed to the call to raw_formatted_data().

Parameters:

read_session_id (str) – identifier of the read session

Raises:

Exception – if an error occured while the read

iter_tuples(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None, ascending=True)#

Get the rows of the dataset as tuples.

The order and type of the values are the same are matching the dataset’s parameter

Values are cast according to their types. String are parsed into “unicode” values.

Parameters:
  • limit (int, optional) – limits the number of rows returned, defaults to None

  • sampling (str, optional) – sampling method, see dataiku.core.dataset.create_sampling_argument(). Defaults to ‘head’.

  • sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None

  • limit – maximum number of rows to be emitted, defaults to None (all)

  • ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count

  • log_every (int, optional) – print out the number of rows read on stdout, defaults to -1 (no log)

  • timeout (int, optional) – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DKU” processes, defaults to 30

  • columns (list[str], optional) – list of desired columns, defaults to None (all columns)

  • ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True

Yield:

a tuples of columns values

Return type:

generator

get_writer()#

Get a stream writer for this dataset (or its target partition, if applicable).

Caution

The writer must be closed as soon as you don’t need it.

Returns:

a stream writer

Return type:

dataiku.core.dataset_write.DatasetWriter

get_continuous_writer(source_id, split_id=0)#

Get a stream writer for this dataset (or its target partition, if applicable).

dataset = dataiku.Dataset("wikipedia_dataset")
dataset.write_schema([{"name":"data", "type":"string"}, ...])
with dataset.get_continuous_writer(...) as writer:
    for msg in message_iterator:
        writer.write_row_dict({"data":msg.data, ...})
        writer.checkpoint("this_recipe", "some state")
Parameters:
  • source_id (str) – identifier of the source of the stream

  • split_id (int, optional) – split id in the output (for concurrent usage), defaults to 0

Returns:

a stream writer

Return type:

dataiku.core.continuous_write.DatasetContinuousWriter

write_schema(columns, drop_and_create=False, **kwargs)#

Write the dataset schema into the dataset JSON definition file.

Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset.

Caution

Obviously, this must be used with caution.

Parameters:
  • columns (list) – see read_schema()

  • drop_and_create (bool, optional) – whether to drop and recreate the dataset, defaults to False

  • dropAndCreate (bool, optional) – deprecated, use drop_and_create

write_schema_from_dataframe(df, drop_and_create=False, **kwargs)#

Set the schema of this dataset to the schema of a Pandas dataframe.

import dataiku

input_ds = dataiku.Dataset("input_dataset")
my_input_dataframe = input_ds.get_dataframe()
output_ds = dataiku.Dataset("output_dataset")

# Set the schema of "output_ds" to match the columns of "my_input_dataframe"
output_ds.write_schema_from_dataframe(my_input_dataframe)
Parameters:
  • df (pandas.core.frame.DataFrame) – a Pandas dataframe

  • drop_and_create (bool, optional) – whether drop and recreate the dataset, defaults to False

  • dropAndCreate (bool, optional) – deprecated, use drop_and_create

read_metadata()#

Get the metadata attached to this dataset.

The metadata contains label, description checklists, tags and custom metadata of the dataset

Returns:

the metadata as a dict, with fields:

  • label : label of the object (not defined for recipes)

  • description : description of the object (not defined for recipes)

  • checklists : checklists of the object, as a dict with a checklists field, which is a list of checklists, each a dict of fields:

    • id : identifier of the checklist

    • title : label of the checklist

    • createdBy : user who created the checklist

    • createdOn : timestamp of creation, in milliseconds

    • items : list of the items in the checklist, each a dict of

      • done : True if the item has been done

      • text : label of the item

      • createdBy : who created the item

      • createdOn : when the item was created, as a timestamp in milliseconds

      • stateChangedBy : who ticked the item as done (or not done)

      • stateChangedOn : when the item was last changed to done (or not done), as a timestamp in milliseconds

  • tags : list of tags, each a string

  • custom : custom metadata, as a dict with a kv field, which is a dict with any contents the user wishes

  • customFields : dict of custom field info (not defined for recipes)

Return type:

dict

write_metadata(meta)#

Write the metadata to the dataset.

Note

you should set a metadata that you obtained via read_metadata() then modified.

Parameters:

meta (dict) – metadata specifications as dict, see read_metadata()

get_config()#

Get the dataset config.

Returns:

all dataset settings, with many relative to its type. main settings keys are:

  • type: type of the dataset such as “PostgreSQL”, “Filesystem”, etc…

  • name: name of the dataset

  • projectKey: project hosting the dataset

  • schema: dataset schema as dict with ‘columns’ definition

  • partitioning: partitions settings as dict

  • managed: True if the dataset is managed

  • readWriteOptions: dict of read or write options

  • versionTag: version info as dict with ‘versionNumber’, ‘lastModifiedBy’, and ‘lastModifiedOn’

  • creationTag: creation info as dict with ‘versionNumber’, ‘lastModifiedBy’, and ‘lastModifiedOn’

  • tags: list of tags

  • metrics: dict with a list of probes, see Metrics and checks

Return type:

dict

get_last_metric_values(partition='')#

Get the set of last values of the metrics on this dataset.

Parameters:

partition (string) – (optional), the partition for which to fetch the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

Return type:

dataiku.core.metrics.ComputedMetrics

get_metric_history(metric_lookup, partition='')#

Get the set of all values a given metric took on this dataset.

Parameters:
  • metric_lookup (string) – metric name or unique identifier

  • partition (string) – (optional), the partition for which to fetch the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

Return type:

dict

save_external_metric_values(values_dict, partition='')#

Save metrics on this dataset.

The metrics are saved with the type “external”.

Parameters:
  • values_dict (dict) – the values to save, as a dict. The keys of the dict are used as metric names

  • partition (string) – (optional), the partition for which to save the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

Return type:

dict

get_last_check_values(partition='')#

Get the set of last values of the checks on this dataset.

Parameters:

partition (string) – (optional), the partition for which to fetch the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

Return type:

dataiku.core.metrics.ComputedChecks

save_external_check_values(values_dict, partition='')#

Save checks on this dataset.

The checks are saved with the type “external”

Parameters:
  • values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names

  • partition (string) – (optional), the partition for which to save the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

Return type:

dict

dataset.create_sampling_argument(sampling_column=None, limit=None, ratio=None, ascending=True)#

Generate sampling parameters. Please see https://doc.dataiku.com/dss/latest/explore/sampling.html#sampling-methods for more information.

Parameters:
  • sampling (str, optional) – sampling method, see dataiku.core.dataset.create_sampling_argument(). Defaults to ‘head’.

  • sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None

  • limit (int, optional) – set the sampling max rows count, defaults to None

  • ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count

  • ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True

Returns:

sampling parameters

Return type:

dict

class dataiku.core.dataset_write.DatasetWriter(dataset)#

Handle to write to a dataset.

Important

Do not instantiate directly, use dataiku.Dataset.get_writer() instead.

Attention

An instance of DatasetWriter MUST be closed after usage. Failure to close it will lead to incomplete or no data being written to the output dataset

active_writers = {}#
static atexit_handler()#
write_tuple(row)#

Write a single row from a tuple or list of column values.

Columns must be given in the order of the dataset schema. Strings MUST be given as Unicode object. Giving str objects will fail.

Note

The schema of the dataset MUST be set before using this.

Parameters:

row (list) – a list of values, one per column in the output schema. Columns cannot be omitted.

write_row_array(row)#

Write a single row from an array.

Caution

Deprecated, use write_tuple()

write_row_dict(row_dict)#

Write a single row from a dict of column name -> column value.

Some columns can be omitted, empty values will be inserted instead. Strings MUST be given as Unicode object. Giving str objects will fail.

Note

The schema of the dataset MUST be set before using this.

Parameters:

row_dict (dict) – a dict of column name to column value

write_dataframe(df)#

Append a Pandas dataframe to the dataset being written.

This method can be called multiple times (especially when you have been using dataiku.Dataset.iter_dataframes() to read from an input dataset).

Strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

Parameters:

df (DataFrame) – a Pandas dataframe

close()#

Close this dataset writer.

class dataiku.core.dataset.Schema(data)#

List of the definitions of the columns in a dataset.

Each column definition is a dict with at least a name field and a type field. Available columns types include:

type

note

sample value

string

b’foobar’

bigint

64 bits

9223372036854775807

int

32 bits

2147483647

smallint

16 bits

32767

tinyint

8 bits

127

double

64 bits

3.1415

float

32 bits

3.14

boolean

32 bits

true

date

string

“2020-12-31T00:00:00.101Z”

array

json string

‘[“foo”,”bar”]’

map

json string

‘{“foo”:”bar”}’

object

json string

‘{“foo”:{“bar”:[1,2,3]}}’

geopoint

string

“POINT(12 24)”

geometry

string

“POLYGON((1.1 1.1, 2.1 0, 0.1 0))”

Each column definition has fields:

  • name: name of the column as string

  • type: type of the column as string

  • maxLength: maximum length of values (when applicable, typically for string)

  • comment: user comment on the column

  • timestampNoTzAsDate: for columns of type “date” in non-managed datasets, whether the actual type in the underlying SQL database or file bears timezone information

  • originalType and originalSQLType: for columns in non-managed datasets, the name of the column type in the underlying SQL database or file

  • arrayContent: for array-typed columns, a column definition that applies to all elements in the array

  • mapKeys and mapValues: for map-types columns, a column definition that applies to all keys (resp. values) in the map

  • objectFields: for object-typed columns, a list of column definitions for the sub-fields in the object

class dataiku.core.dataset.DatasetCursor(val, col_names, col_idx)#

A dataset cursor iterating on the rows.

Caution

you should not instantiate it manually, see dataiku.Dataset.iter_rows()

column_id(name)#

Get a column index from its name.

Parameters:

name (str) – column name

Returns:

the column index

Return type:

int

keys()#

Get the set of column names.

Returns:

list of columns name

Return type:

list[str]

items()#

Get the full row.

Returns:

a list of tuple (column, value)

Return type:

list[tuple]

values()#

Get values in the row.

Returns:

list of columns values

Return type:

list

get(col_name, default_value=None)#

Get a value by its column name.

Parameters:
  • col_name (str) – a column name

  • default_value (str, optional) – value to return if the column is not present, defaults to None

Returns:

the value of the column

Return type:

depends on the column’s type

The dataikuapi.dss.dataset package#

Main DSSDataset class#

class dataikuapi.dss.dataset.DSSDataset(client, project_key, dataset_name)#

A dataset on the DSS instance. Do not instantiate this class, use dataikuapi.dss.project.DSSProject.get_dataset()

property id#

Get the dataset identifier.

Return type:

string

property name#

Get the dataset name.

Return type:

string

delete(drop_data=False)#

Delete the dataset.

Parameters:

drop_data (bool) – Should the data of the dataset be dropped, defaults to False

rename(new_name)#

Rename the dataset with the new specified name

Parameters:

new_name (str) – the new name of the dataset

get_settings()#

Get the settings of this dataset as a DSSDatasetSettings, or one of its subclasses.

Know subclasses of DSSDatasetSettings include FSLikeDatasetSettings and SQLDatasetSettings

You must use save() on the returned object to make your changes effective on the dataset.

# Example: activating discrete partitioning on a SQL dataset
dataset = project.get_dataset("my_database_table")
settings = dataset.get_settings()
settings.add_discrete_partitioning_dimension("country")
settings.save()
Return type:

DSSDatasetSettings

get_definition()#

Get the raw settings of the dataset as a dict.

Caution

Deprecated. Use get_settings()

Return type:

dict

set_definition(definition)#

Set the definition of the dataset

Caution

Deprecated. Use get_settings() and DSSDatasetSettings.save()

Parameters:

definition (dict) – the definition, as a dict. You should only set a definition object that has been retrieved using the get_definition() call.

exists()#

Test if the dataset exists.

Returns:

whether this dataset exists

Return type:

bool

get_schema()#

Get the dataset schema.

Returns:

a dict object of the schema, with the list of columns.

Return type:

dict

set_schema(schema)#

Set the dataset schema.

Parameters:

schema (dict) – the desired schema for the dataset, as a dict. All columns have to provide their name and type.

get_metadata()#

Get the metadata attached to this dataset. The metadata contains label, description checklists, tags and custom metadata of the dataset

Returns:

a dict object. For more information on available metadata, please see https://doc.dataiku.com/dss/api/latest/rest/

Return type:

dict

set_metadata(metadata)#

Set the metadata on this dataset.

Parameters:

metadata (dict) – the new state of the metadata for the dataset. You should only set a metadata object that has been retrieved using the get_metadata() call.

iter_rows(partitions=None)#

Get the dataset data as a row-by-row iterator.

Parameters:

partitions (Union[string, list[string]]) – (optional) partition identifier, or list of partitions to include, if applicable.

Returns:

an iterator over the rows, each row being a list of values. The order of values in the list is the same as the order of columns in the schema returned by get_schema()

Return type:

generator[list]

list_partitions()#

Get the list of all partitions of this dataset.

Returns:

the list of partitions, as a list of strings.

Return type:

list[string]

clear(partitions=None)#

Clear data in this dataset.

Parameters:

partitions (Union[string, list[string]]) – (optional) partition identifier, or list of partitions to clear. When not provided, the entire dataset is cleared.

Returns:

a dict containing the method call status.

Return type:

dict

copy_to(target, sync_schema=True, write_mode='OVERWRITE')#

Copy the data of this dataset to another dataset.

Parameters:
  • target (dataikuapi.dss.dataset.DSSDataset) – an object representing the target of this copy.

  • sync_schema (bool) – (optional) update the target dataset schema to make it match the sourece dataset schema.

  • write_mode (string) – (optional) OVERWRITE (default) or APPEND. If OVERWRITE, the output dataset is cleared prior to writing the data.

Returns:

a DSSFuture representing the operation.

Return type:

dataikuapi.dss.future.DSSFuture

search_data_elastic(query_string, start=0, size=128, sort_columns=None, partitions=None)#

Caution

Only for datasets on Elasticsearch connections

Query the service with a search string to directly fetch data

Parameters:
  • query_string (str) – Elasticsearch compatible query string

  • start (int) – row to start fetching the data

  • size (int) – number of results to return

  • sort_columns (list) – list of {“column”, “order”} dict, which is the order to fetch data. “order” is “asc” for ascending, “desc” for descending

  • partitions (list) – if the dataset is partitioned, a list of partition ids to search

Returns:

a dict containing “columns”, “rows”, “warnings”, “found” (when start == 0)

Return type:

dict

build(job_type='NON_RECURSIVE_FORCED_BUILD', partitions=None, wait=True, no_fail=False)#

Start a new job to build this dataset and wait for it to complete. Raises if the job failed.

job = dataset.build()
print("Job %s done" % job.id)
Parameters:
  • job_type – the job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD

  • partitions – if the dataset is partitioned, a list of partition ids to build

  • wait (bool) – whether to wait for the job completion before returning the job handle, defaults to True

  • no_fail – if True, does not raise if the job failed.

Returns:

the dataikuapi.dss.job.DSSJob job handle corresponding to the built job

Return type:

dataikuapi.dss.job.DSSJob

synchronize_hive_metastore()#

Synchronize this dataset with the Hive metastore

update_from_hive()#

Resynchronize this dataset from its Hive definition

compute_metrics(partition='', metric_ids=None, probes=None)#

Compute metrics on a partition of this dataset.

If neither metric ids nor custom probes set are specified, the metrics setup on the dataset are used.

Parameters:
  • partition (string) – (optional) partition identifier, use ALL to compute metrics on all data.

  • metric_ids (list[string]) – (optional) ids of the metrics to build

Returns:

a metric computation report, as a dict

Return type:

dict

run_checks(partition='', checks=None)#

Run checks on a partition of this dataset.

If the checks are not specified, the checks setup on the dataset are used.

Parameters:
  • partition (str) – (optional) partition identifier, use ALL to run checks on all data.

  • checks (list[string]) – (optional) ids of the checks to run.

Returns:

a checks computation report, as a dict.

Return type:

dict

uploaded_add_file(fp, filename)#

Add a file to an “uploaded files” dataset

Parameters:
  • fp (file) – A file-like object that represents the file to upload

  • filename (str) – The filename for the file to upload

uploaded_list_files()#

List the files in an “uploaded files” dataset.

Returns:

uploaded files metadata as a list of dicts, with one dict per file.

Return type:

list[dict]

create_prediction_ml_task(target_variable, ml_backend_type='PY_MEMORY', guess_policy='DEFAULT', prediction_type=None, wait_guess_complete=True)#

Creates a new prediction task in a new visual analysis lab for a dataset.

Parameters:
  • target_variable (str) – the variable to predict

  • ml_backend_type (str) – ML backend to use, one of PY_MEMORY, MLLIB or H2O (defaults to PY_MEMORY)

  • guess_policy (str) – Policy to use for setting the default parameters. Valid values are: DEFAULT, SIMPLE_FORMULA, DECISION_TREE, EXPLANATORY and PERFORMANCE (defaults to DEFAULT)

  • prediction_type (str) – The type of prediction problem this is. If not provided the prediction type will be guessed. Valid values are: BINARY_CLASSIFICATION, REGRESSION, MULTICLASS (defaults to None)

  • wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms (defaults to True). You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

Returns:

A ML task handle of type ‘PREDICTION’

Return type:

dataikuapi.dss.ml.DSSMLTask

create_clustering_ml_task(input_dataset, ml_backend_type='PY_MEMORY', guess_policy='KMEANS', wait_guess_complete=True)#

Creates a new clustering task in a new visual analysis lab for a dataset.

The returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms.

You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

Parameters:
  • input_dataset (string) – The dataset to use for training/testing the model

  • ml_backend_type (str) – ML backend to use, one of PY_MEMORY, MLLIB or H2O (defaults to PY_MEMORY)

  • guess_policy (str) – Policy to use for setting the default parameters. Valid values are: KMEANS and ANOMALY_DETECTION (defaults to KMEANS)

  • wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms (defaults to True). You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

Returns:

A ML task handle of type ‘CLUSTERING’

Return type:

dataikuapi.dss.ml.DSSMLTask

create_timeseries_forecasting_ml_task(target_variable, time_variable, timeseries_identifiers=None, guess_policy='TIMESERIES_DEFAULT', wait_guess_complete=True)#

Creates a new time series forecasting task in a new visual analysis lab for a dataset.

Parameters:
  • target_variable (string) – The variable to forecast

  • time_variable (string) – Column to be used as time variable. Should be a Date (parsed) column.

  • timeseries_identifiers (list) – List of columns to be used as time series identifiers (when the dataset has multiple series)

  • guess_policy (string) – Policy to use for setting the default parameters. Valid values are: TIMESERIES_DEFAULT, TIMESERIES_STATISTICAL, and TIMESERIES_DEEP_LEARNING

  • wait_guess_complete (boolean) – If False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

Returns:

A ML task handle of type ‘PREDICTION’

Return type:

dataikuapi.dss.ml.DSSMLTask

create_causal_prediction_ml_task(outcome_variable, treatment_variable, prediction_type=None, wait_guess_complete=True)#

Creates a new causal prediction task in a new visual analysis lab for a dataset.

Parameters:
  • outcome_variable (string) – The outcome variable to predict.

  • treatment_variable (string) – The treatment variable.

  • prediction_type (string or None) – Valid values are: “CAUSAL_BINARY_CLASSIFICATION”, “CAUSAL_REGRESSION” or None (in this case prediction_type will be set by the Guesser)

  • wait_guess_complete (boolean) – If False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

Returns:

A ML task handle of type ‘PREDICTION’

Return type:

dataikuapi.dss.ml.DSSMLTask

create_analysis()#

Create a new visual analysis lab for the dataset.

Returns:

A visual analysis handle

Return type:

dataikuapi.dss.analysis.DSSAnalysis

list_analyses(as_type='listitems')#

List the visual analyses on this dataset

Parameters:

as_type (str) – How to return the list. Supported values are “listitems” and “objects”, defaults to “listitems”

Returns:

The list of the analyses. If “as_type” is “listitems”, each one as a dict, If “as_type” is “objects”, each one as a dataikuapi.dss.analysis.DSSAnalysis

Return type:

list

delete_analyses(drop_data=False)#

Delete all analyses that have this dataset as input dataset. Also deletes ML tasks that are part of the analysis

Parameters:

drop_data (bool) – whether to drop data for all ML tasks in the analysis, defaults to False

list_statistics_worksheets(as_objects=True)#

List the statistics worksheets associated to this dataset.

Parameters:

as_objects (bool) – if true, returns the statistics worksheets as dataikuapi.dss.statistics.DSSStatisticsWorksheet, else as a list of dicts

Return type:

list of dataikuapi.dss.statistics.DSSStatisticsWorksheet

create_statistics_worksheet(name='My worksheet')#

Create a new worksheet in the dataset, and return a handle to interact with it.

Parameters:

name (string) – name of the worksheet

Returns:

a statistic worksheet handle

Return type:

dataikuapi.dss.statistics.DSSStatisticsWorksheet

get_statistics_worksheet(worksheet_id)#

Get a handle to interact with a statistics worksheet

Parameters:

worksheet_id (string) – the ID of the desired worksheet

Returns:

a statistic worksheet handle

Return type:

dataikuapi.dss.statistics.DSSStatisticsWorksheet

get_last_metric_values(partition='')#

Get the last values of the metrics on this dataset

Parameters:

partition (string) – (optional) partition identifier, use ALL to retrieve metric values on all data.

Returns:

a list of metric objects and their value

Return type:

dataikuapi.dss.metrics.ComputedMetrics

get_metric_history(metric, partition='')#

Get the history of the values of the metric on this dataset

Parameters:
  • metric (string) – id of the metric to get

  • partition (string) – (optional) partition identifier, use ALL to retrieve metric history on all data.

Returns:

a dict containing the values of the metric, cast to the appropriate type (double, boolean,…)

Return type:

dict

get_info()#

Retrieve all the information about a dataset

Returns:

a DSSDatasetInfo containing all the information about a dataset.

Return type:

DSSDatasetInfo

get_zone()#

Get the flow zone of this dataset

Return type:

dataikuapi.dss.flow.DSSFlowZone

move_to_zone(zone)#

Move this object to a flow zone

Parameters:

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to move the object

share_to_zone(zone)#

Share this object to a flow zone

Parameters:

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to share the object

unshare_from_zone(zone)#

Unshare this object from a flow zone

Parameters:

zone (object) – a dataikuapi.dss.flow.DSSFlowZone from where to unshare the object

get_usages()#

Get the recipes or analyses referencing this dataset

Returns:

a list of usages

Return type:

list[dict]

get_object_discussions()#

Get a handle to manage discussions on the dataset

Returns:

the handle to manage discussions

Return type:

dataikuapi.dss.discussion.DSSObjectDiscussions

test_and_detect(infer_storage_types=False)#

Used internally by autodetect_settings() It is not usually required to call this method

Parameters:

infer_storage_types (bool) – whether to infer storage types

autodetect_settings(infer_storage_types=False)#

Detect appropriate settings for this dataset using Dataiku detection engine

Parameters:

infer_storage_types (bool) – whether to infer storage types

Returns:

new suggested settings that you can DSSDatasetSettings.save()

Return type:

DSSDatasetSettings or a subclass

get_as_core_dataset()#

Get the dataiku.Dataset object corresponding to this dataset

Return type:

dataiku.Dataset

new_code_recipe(type, code=None, recipe_name=None)#

Start the creation of a new code recipe taking this dataset as input.

Parameters:
  • type (str) – type of the recipe (‘python’, ‘r’, ‘pyspark’, ‘sparkr’, ‘sql’, ‘sparksql’, ‘hive’, …).

  • code (str) – the code of the recipe.

  • recipe_name (str) – (optional) base name for the new recipe.

Returns:

a handle to the new recipe’s creator object.

Return type:

Union[dataikuapi.dss.recipe.CodeRecipeCreator, dataikuapi.dss.recipe.PythonRecipeCreator]

new_recipe(type, recipe_name=None)#

Start the creation of a new recipe taking this dataset as input. For more details, please see dataikuapi.dss.project.DSSProject.new_recipe()

Parameters:
  • type (str) – type of the recipe (‘python’, ‘r’, ‘pyspark’, ‘sparkr’, ‘sql’, ‘sparksql’, ‘hive’, …).

  • recipe_name (str) – (optional) base name for the new recipe.

get_data_quality_rules()#

Get a handle to interact with the data quality rules of the dataset.

Returns:

A handle to the data quality rules of the dataset.

Return type:

dataikuapi.dss.data_quality.DSSDataQualityRuleSet

get_column_lineage(column, max_dataset_count=None)#

Get the full lineage (auto-computed and manual) information of a column in this dataset. Column relations with datasets from both local and foreign projects will be included in the result.

Parameters:
  • column (str) – name of the column to retrieve the lineage on.

  • max_dataset_count (integer) – (optional) the maximum number of datasets to query for. If none, then the max hard limit is used.

Returns:

the full column lineage (auto-computed and manual) as a list of relations.

Return type:

list of dict

Listing datasets#

class dataikuapi.dss.dataset.DSSDatasetListItem(client, data)#

An item in a list of datasets.

Caution

Do not instantiate this class, use dataikuapi.dss.project.DSSProject.list_datasets()

to_dataset()#

Gets a handle on the corresponding dataset.

Returns:

a handle on a dataset

Return type:

DSSDataset

property name#

Get the name of the dataset.

Return type:

string

property id#

Get the identifier of the dataset.

Return type:

string

property type#

Get the type of the dataset.

Return type:

string

property schema#

Get the dataset schema as a dict.

Returns:

a dict object of the schema, with the list of columns. See DSSDataset.get_schema()

Return type:

dict

property connection#

Get the name of the connection on which this dataset is attached, or None if there is no connection for this dataset.

Return type:

string

get_column(column)#

Get a given column in the dataset schema by its name.

Parameters:

column (str) – name of the column to find

Returns:

the column settings or None if column does not exist

Return type:

dict

Settings of datasets#

class dataikuapi.dss.dataset.DSSDatasetSettings(dataset, settings)#

Base settings class for a DSS dataset.

Caution

Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

get_raw()#

Get the raw dataset settings as a dict.

Return type:

dict

get_raw_params()#

Get the type-specific params, as a raw dict.

Return type:

dict

property type#

Returns the settings type as a string.

Return type:

string

property schema_columns#

Get the schema columns settings.

Returns:

a list of dicts with column settings.

Return type:

list[dict]

remove_partitioning()#

Reset partitioning settings to those of a non-partitioned dataset.

add_discrete_partitioning_dimension(dim_name)#

Add a discrete partitioning dimension to settings.

Parameters:

dim_name (string) – name of the partition to add.

add_time_partitioning_dimension(dim_name, period='DAY')#

Add a time partitioning dimension to settings.

Parameters:
  • dim_name (string) – name of the partition to add.

  • period (string) – (optional) time granularity of the created partition. Can be YEAR, MONTH, DAY, HOUR.

add_raw_schema_column(column)#

Add a column to the schema settings.

Parameters:

column (dict) – column settings to add.

property is_feature_group#

Indicates whether the Dataset is defined as a Feature Group, available in the Feature Store.

Return type:

bool

set_feature_group(status)#

(Un)sets the dataset as a Feature Group, available in the Feature Store. Changes of this property will be applied when calling save() and require the “Manage Feature Store” permission.

Parameters:

status (bool) – whether the dataset should be defined as a feature group

save()#

Save settings.

class dataikuapi.dss.dataset.SQLDatasetSettings(dataset, settings)#

Settings for a SQL dataset. This class inherits from DSSDatasetSettings.

Caution

Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

set_table(connection, schema, table, catalog=None)#

Sets this SQL dataset in ‘table’ mode, targeting a particular table of a connection Leave catalog to None to target the default database associated with the connection

class dataikuapi.dss.dataset.FSLikeDatasetSettings(dataset, settings)#

Settings for a files-based dataset. This class inherits from DSSDatasetSettings.

Caution

Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

set_connection_and_path(connection, path)#

Set connection and path parameters.

Parameters:
  • connection (string) – connection to use.

  • path (string) – path to use.

get_raw_format_params()#

Get the raw format parameters as a dict.

Return type:

dict

set_format(format_type, format_params=None)#

Set format parameters.

Parameters:
  • format_type (string) – format type to use.

  • format_params (dict) – dict of parameters to assign to the formatParams settings section.

set_csv_format(separator=',', style='excel', skip_rows_before=0, header_row=True, skip_rows_after=0)#

Set format parameters for a csv-based dataset.

Parameters:
  • separator (string) – (optional) separator to use, default is ‘,’”.

  • style (string) – (optional) style to use, default is ‘excel’.

  • skip_rows_before (int) – (optional) number of rows to skip before header, default is 0.

  • header_row (bool) – (optional) wheter or not the header row is parsed, default is true.

  • skip_rows_after (int) – (optional) number of rows to skip before header, default is 0.

set_partitioning_file_pattern(pattern)#

Set the dataset partitionning file pattern.

Parameters:

pattern (str) – pattern to set.

Dataset Information#

class dataikuapi.dss.dataset.DSSDatasetInfo(dataset, info)#

Info class for a DSS dataset (Read-Only).

Caution

Do not instantiate this class directly, use DSSDataset.get_info()

get_raw()#

Get the raw dataset full information as a dict

Returns:

the raw dataset full information

Return type:

dict

property last_build_start_time#

The last build start time of the dataset as a datetime.datetime or None if there is no last build information.

Returns:

the last build start time

Return type:

datetime.datetime or None

property last_build_end_time#

The last build end time of the dataset as a datetime.datetime or None if there is no last build information.

Returns:

the last build end time

Return type:

datetime.datetime or None

property is_last_build_successful#

Get whether the last build of the dataset is successful.

Returns:

True if the last build is successful

Return type:

bool

Creation of managed datasets#

class dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper(project, dataset_name)#

Provide an helper to create partitioned dataset

import dataiku

client = dataiku.api_client()
project_key = dataiku.default_project_key()
project = client.get_project(project_key)

#create the dataset
builder = project.new_managed_dataset("py_generated")
builder.with_store_into("filesystem_folders")
dataset = builder.create(overwrite=True)

#setup format & schema  settings
ds_settings = ds.get_settings()
ds_settings.set_csv_format()
ds_settings.add_raw_schema_column({'name':'id', 'type':'int'})
ds_settings.add_raw_schema_column({'name':'name', 'type':'string'})
ds_settings.save()

#put some data
data = ["foo", "bar"]
with ds.get_as_core_dataset().get_writer() as writer:
    for idx, val in enumerate(data):
        writer.write_row_array((idx, val))

Caution

Do not instantiate directly, use dataikuapi.dss.project.DSSProject.new_managed_dataset()

get_creation_settings()#

Get the dataset creation settings as a dict.

Return type:

dict

with_store_into(connection, type_option_id=None, format_option_id=None)#

Sets the connection into which to store the new managed dataset

Parameters:
  • connection (str) – Name of the connection to store into

  • type_option_id (str) – If the connection accepts several types of datasets, the type

  • format_option_id (str) – Optional identifier of a file format option

Returns:

self

with_copy_partitioning_from(dataset_ref, object_type='DATASET')#

Sets the new managed dataset to use the same partitioning as an existing dataset

Parameters:
  • dataset_ref (str) – Name of the dataset to copy partitioning from

  • object_type (str) – Type of the object to copy partitioning from, values can be DATASET or FOLDER

Returns:

self

create(overwrite=False)#

Executes the creation of the managed dataset according to the selected options

Parameters:

overwrite (bool, optional) – If the dataset being created already exists, delete it first (removing data), defaults to False

Returns:

the newly created dataset

Return type:

DSSDataset

already_exists()#

Check if dataset already exists.

Returns:

whether this managed dataset already exists

Return type:

bool