Datasets#
Please see Datasets for an introduction to interacting with datasets in Dataiku Python API
The dataiku.Dataset class#
- class dataiku.Dataset(name, project_key=None, ignore_flow=False)#
Provides a handle to obtain readers and writers on a dataiku Dataset. From this Dataset class, you can:
Read a dataset as a Pandas dataframe
Read a dataset as a chunked Pandas dataframe
Read a dataset row-by-row
Write a pandas dataframe to a dataset
Write a series of chunked Pandas dataframes to a dataset
Write to a dataset row-by-row
Edit the schema of a dataset
- Parameters:
name (str) – The name of the dataset.
project_key (str) – The key of the project in which the dataset is located (current project key if none is specified)
ignore_flow (boolean) – this parameter is only relevant for recipes, not for notebooks or code in metrics or scenario steps. when in a recipe, if it’s left to False, then DSS also checks whether the dataset is part of the inputs or outputs of the recipe and raises an error if it’s not, defaults to False
- Returns:
a handle to interact with the dataset
- Return type:
- static list(project_key=None)#
List the names of datasets of a given project.
Usage example:
import dataiku # current project datasets current_project_datasets = dataiku.Dataset.list() # given project datasets my_project_datasets = dataiku.Dataset.list("my_project")
- Parameters:
project_key (str) – the optional key of the project to retrieve the datasets from, defaults to current project
- Returns:
a list of a dataset names
- Return type:
list[str]
- property full_name#
Get the fully-qualified identifier of the dataset on the DSS instance.
- Returns:
a fully qualified identifier for the dataset in the form “project_key.dataset_name”
- Return type:
str
- get_location_info(sensitive_info=False)#
Retrieve the location information of the dataset.
Usage example
# save a dataframe to csv with fixed name to S3 dataset = dataiku.Dataset("my_target_dataset") location_info = dataset.get_location_info(True) s3_folder = location_info["info"]["path"] # get URI of the dataset import re # extract the bucket from the URI s3_bucket = re.match("^s3://([^/]+)/.*$", s3_folder).group(1) # extract path inside bucket s3_path_in_bucket = re.match("^s3://[^/]+/(.*)$", s3_folder).group(1) # save to S3 using boto from io import StringIO import boto3 csv_buffer = StringIO() df.to_csv(csv_buffer) s3_resource = boto3.resource('s3') s3_resource.Object(s3_bucket, s3_path_in_bucket + '/myfile.csv').put(Body=csv_buffer.getvalue())
- Parameters:
sensitive_info (boolean) – whether or not to provide sensitive infos such as passwords, conditioned on the user being allowed to read details of the connection on which this dataset is defined
- Returns:
a dict with the location info, with as notable fields:
locationInfoType: type of location. Possible values are ‘FS’, ‘HDFS’, ‘UPLOADED’, ‘SQL’
info : a dict whose structure depends on the type of connection
connectionName: connection name, if any
connectionParams : parameters of the connection on which the dataset is defined, as a dic, if any. The actual fields depend on the connection type. For S3 dataset, this will for example contain the bucket and credentials.
path : the URI of the dataset, if any
- Return type:
dict
- get_files_info(partitions=[])#
Get information on the files of the dataset, with details per partition.
- Parameters:
partitions (list[str], optional) – list of partition identifiers, defaults to all partitions
- Returns:
global files information and per partitions
globalPaths: list of files of the dataset.
path: file path
lastModified: timestamp of last file update, in milliseconds
size: size of the file, in bytes
pathsByPartition: files grouped per partition, as a dict of partition identifier to list of files (same structure as globalPaths)
- Return type:
dict
- set_write_partition(spec)#
Set which partition of the dataset gets written to when you create a DatasetWriter.
Caution
Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.
- Parameters:
spec (string) – partition identifier
- add_read_partitions(spec)#
Add a partition or range of partitions to read.
Caution
You cannot manually add read partitions when running inside a Python recipe. They are automatically computed according to the partition dependencies defined on the recipe’s Input/Output tab.
- Parameters:
spec (string) – partition spec, or partition identifier
- read_schema(raise_if_empty=True)#
Get the schema of this dataset, as an array of column definition.
- Parameters:
raise_if_empty (bool, optional) – raise an exception if there is no column, defaults to True
- Returns:
list of column definitions
- Return type:
- list_partitions(raise_if_empty=True)#
List the partitions of this dataset, as an array of partition identifiers.
Usage example
# build a list of partitions for use in a build/train step in a scenario dataset = dataiku.Dataset("some_input_dataset") partitions = dataset.list_partitions() variable_value = ','.join(partitions) # set as a variable, to use in steps after this one Scenario().set_scenario_variables(som_variable_name=variable_value)
- Parameters:
raise_if_empty (bool, optional) – raise an exception if there is no partition, defaults to True
- Returns:
list of partitions identifiers
- Return type:
list[string]
- set_preparation_steps(steps, requested_output_schema, context_project_key=None)#
Set preparation steps.
Caution
for internal use
- Parameters:
steps (list) – list of steps
requested_output_schema (dict) – output schema with a key columns containing a list of columns definition (name, type, …)
context_project_key (string, optional) – context project key, defaults to None
- get_fast_path_dataframe(auto_fallback=False, columns=None, pandas_read_kwargs=None, print_deep_memory_usage=True)#
Reads the dataset as a Pandas dataframe, using fast-path access (without going through DSS), if possible.
Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.
The fast path method provides better performance than the usual
get_dataframe()
method, but is only compatible with some dataset types and formats.Fast path requires the “permission details readable” to be granted on the connection.
Dataframes obtained using this method may differ from those using
get_dataframe()
, notably around schemas and data.get_dataframe()
provides a unified API with the same schema and data for all connections. On the other hand, this method uses dataset-specific access patterns that may yield different results.At the moment, this fast path is available for:
S3 datasets using Parquet. This requires the additional s3fs package, as well as fastparquet or pyarrow
Snowflake datasets. This requires the additional snowflake-connector-python[pandas] package
- Parameters:
columns (list) – List of columns to read, or None for all columns
auto_fallback (boolean) – If fast path is impossible and auto_fallback is True, then a regular
get_dataframe()
call will be used. If auto_fallback is False, this method will failprint_deep_memory_usage (bool) – After reading the dataframe, Dataiku prints the memory usage of the dataframe. When this is enabled, this will provide the accurate memory usage, including for string columns. This can have a small performance impact. Defaults to True
pandas_read_kwargs (dict) – For the case where the read is mediated by a call to pd.read_parquet, arguments to pass to the read_parquet function
- get_dataframe(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, ascending=True, infer_with_pandas=True, parse_dates=True, bool_as_str=False, int_as_float=False, use_nullable_integers=False, categoricals=None, float_precision=None, na_values=None, keep_default_na=True, print_deep_memory_usage=True, skip_additional_data_checks=False, date_parser=None, override_dtypes=None, pandas_read_kwargs=None)#
Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.
Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.
# read some dataset and print its shape dataset = dataiku.Dataset("the_dataset_name") df = dataset.get_dataframe() print("Number of rows: %s" df.shape[0]) print("Number of columns: %s" df.shape[1])
- Parameters:
columns (list) – when not None, returns only columns from the given list. defaults to None
limit (integer) – limits the number of rows returned, defaults to None
sampling – sampling method, see
dataiku.core.dataset.create_sampling_argument()
. Defaults to ‘head’.sampling_column (string) – column used for “random-column” and “sort-column” sampling, defaults to None
ratio (float) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count
boolean (ascending) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True
infer_with_pandas (bool) – uses the types detected by pandas rather than the dataset schema as detected in DSS, defaults to True
parse_dates (bool) – Only used when infer_with_pandas is False. Parses date column in DSS schema. Defaults to True
bool_as_str (bool) – Only used when infer_with_pandas is False. Leaves boolean values as string. Defaults to False
int_as_float (bool) – Only used when infer_with_pandas is False. Leaves int values as floats. Defaults to False
use_nullable_integers (bool) – Only used when infer_with_pandas is False. Use pandas nullable integer types, which allows missing values in integer columns
categoricals – Only used when infer_with_pandas is False. What columns to read as categoricals. This is particularly efficient for columns with low cardinality. Can be either “all_strings” to read all string columns as categorical, or a list of column names to read as categoricals
float_precision (string) – set Pandas converter, can be None, ‘high’, ‘legacy’ or ‘round_trip’, defaults to None. see Pandas.read_table documentation for more information
na_values (string/list/dict) –
additional strings to recognize as NA/NaN, defaults to None. see Pandas.read_table documentation for more information
keep_default_na (bool) –
whether or not to include the default NaN values when parsing the data, defaults to True. see Pandas.read_table documentation for more information
date_parser (function) –
function to use for converting a sequence of string columns to an array of datetime instances, defaults to None. see Pandas.read_table documentation for more information
skip_additional_data_checks (bool) – Skip some data type checks. Enabling this can lead to strongly increased performance (up to x3). It is usually safe to enable this. Default to False
print_deep_memory_usage (bool) – After reading the dataframe, Dataiku prints the memory usage of the dataframe. When this is enabled, this will provide the accurate memory usage, including for string columns. This can have a small performnace impact. Defaults to True
override_dtypes (dict) – If not None, overrides dtypes computed from schema. Defaults to None
pandas_read_kwargs (dict) – If not None, additional kwargs passed to pd.read_table. Defaults to None
- Returns:
a Pandas dataframe object
- Return type:
pandas.core.frame.DataFrame
- to_html(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, apply_conditional_formatting=True, header=True, classes='', border=0, null_string='', indent_string=None, filter_expression=None)#
Render the dataset as an html table.
HTML tables are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this, or pass a value to the limit parameter.
# read some dataset and displays the first 50 rows dataset = dataiku.Dataset("the_dataset_name") df = dataset.to_html(limit=50)
- Parameters:
columns (list[str]) – when not None, returns only columns from the given list. Defaults to None
sampling – sampling method, see
dataiku.core.dataset.create_sampling_argument()
. Defaults to ‘head’.sampling_column (string) – column used for “random-column” and “sort-column” sampling, defaults to None
limit (integer) – limits the number of rows returned, defaults to None
ratio (float) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count
apply_conditional_formatting (bool) – true to apply conditional formatting as it has been defined in DSS Explore view
header (bool) – Whether to print column labels, default True.
classes (str or list[str]) – Name of the CSS class attached to TABLE tag in the generated HTML (or multiple classes as a list).
border (int) – A border attribute of the specified size is included in the opening <table> tag. Default to 0
null_string (str) – string to represent null values. Defaults to an empty string.
indent_string (str) – characters to use to indent the formatted HTML. If None or empty string, no indentation and no carriage return line feed. Defaults to None
filter_expression (str) – expression used to filter data using formula language, defaults to None. Not supported on datasets with preparation steps.
- Returns:
an HTML representation of the dataset
- Return type:
str
- static get_dataframe_schema_st(schema, columns=None, parse_dates=True, infer_with_pandas=False, bool_as_str=False, int_as_float=False, use_nullable_integers=False, categoricals=None)#
Extract information for Pandas from a schema.
See
get_dataframe()
for explanation of the other parameters- Parameters:
schema (list[dict]) – a schema definition as returned by
read_schema()
- Returns:
a list of 3 items:
a list columns names
a dict of columns Numpy data types by names
a list of the indexes of the dates columns or False
- Return type:
tuple[list,dict,list]
- iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None, na_values=None, keep_default_na=True, date_parser=None, ascending=True, pandas_read_kwargs=None)#
Read the dataset to Pandas dataframes by chunks of fixed size with given data types.
import dataiku dataset = dataiku.Dataset("my_dataset") [names, dtypes, parse_date_columns] = dataiku.Dataset.get_dataframe_schema_st(dataset.read_schema()) chunk = 0 chunksize = 1000 headsize = 5 for df in dataset.iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize = chunksize): print("> chunk #", chunk, "- first", headsize, "rows of", df.shape[0]) chunk += 1 print(df.head(headsize))
- Parameters:
names (list[string]) – list of column names
dtypes (dict) – dict of data types by columns name
parse_date_columns (list) – a list of the indexes of the dates columns or False
chunksize (int, optional) – chunk size, defaults to 10000
limit (integer) – limits the number of rows returned, defaults to None
sampling (str, optional) – sampling method, see
dataiku.core.dataset.create_sampling_argument()
. Defaults to ‘head’.sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None
ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count
float_precision (string, optional) –
set Pandas converter, can be None, ‘high’, ‘legacy’ or ‘round_trip’, defaults to None. see Pandas.read_table documentation for more information
na_values (string/list/dict, optional) –
additional strings to recognize as NA/NaN, defaults to None. see Pandas.read_table documentation for more information
keep_default_na (bool, optional) –
whether or not to include the default NaN values when parsing the data, defaults to True. see Pandas.read_table documentation for more information
date_parser (function, optional) –
function to use for converting a sequence of string columns to an array of datetime instances, defaults to None. see Pandas.read_table documentation for more information
ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True
pandas_read_kwargs (dict) – If not None, additional kwargs passed to pd.read_table. Defaults to None
- Yield:
pandas.core.frame.DataFrame
- Return type:
generator
- iter_dataframes(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, int_as_float=False, use_nullable_integers=False, categoricals=None, float_precision=None, na_values=None, keep_default_na=True, ascending=True, pandas_read_kwargs=None)#
Read the dataset to Pandas dataframes by chunks of fixed size.
Tip
Useful is the dataset doesn’t fit in RAM
import dataiku dataset = dataiku.Dataset("my_dataset") for df in dataset.iter_dataframes(chunksize = 5000): print("> chunk of", df.shape[0], "rows") print(df.head(headsize))
- Parameters:
chunksize (int, optional) – chunk size, defaults to 10000
infer_with_pandas (bool, optional) – use the types detected by pandas rather than the dataset schema as detected in DSS, defaults to True
limit (int, optional) – limits the number of rows returned, defaults to None
sampling (str, optional) – sampling method, see
dataiku.core.dataset.create_sampling_argument()
. Defaults to ‘head’.sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None
parse_dates (bool, optional) – date column in DSS’s dataset schema are parsed, defaults to True
limit – set the sampling max rows count, defaults to None
ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count
columns (list[str], optional) – specify the desired columns, defaults to None (all columns)
bool_as_str (bool, optional) – Only used when infer_with_pandas is False. Leaves boolean values as strings, defaults to False
int_as_float (bool, optional) – Only used when infer_with_pandas is False. Leaves int values as floats. Defaults to False
use_nullable_integers (bool, optional) – Only used when infer_with_pandas is False. Use pandas nullable integer types, which allows missing values in integer columns. Defaults to False
categoricals (string/list, optional) – Only used when infer_with_pandas is False. What columns to read as categoricals. This is particularly efficient for columns with low cardinality. Can be either “all_strings” to read all string columns as categorical, or a list of column names to read as categoricals
float_precision (string, optional) –
set Pandas converter, can be None, ‘high’, ‘legacy’ or ‘round_trip’, defaults to None. see Pandas.read_table documentation for more information
na_values (string/list/dict, optional) –
additional strings to recognize as NA/NaN, defaults to None. see Pandas.read_table documentation for more information
keep_default_na (bool, optional) –
whether or not to include the default NaN values when parsing the data, defaults to True. see Pandas.read_table documentation for more information
ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True
pandas_read_kwargs (dict) – If not None, additional kwargs passed to pd.read_table. Defaults to None
- Yield:
pandas.core.frame.DataFrame
- Return type:
generator
- write_with_schema(df, drop_and_create=False, **kwargs)#
Write a pandas dataframe to this dataset (or its target partition, if applicable).
This variant replaces the schema of the output dataset with the schema of the dataframe.
Caution
strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
Note
the dataset must be writable, ie declared as an output, except if you instantiated the dataset with ignore_flow=True
import dataiku from dataiku import recipe # simply copy the first recipe input dataset # to the first recipe output dataset, with the schema ds_input = recipe.get_inputs()[0] df_input = ds_input.get_dataframe() ds_output = recipe.get_outputs()[0] ds_output.write_with_schema(df_input, True)
- Parameters:
df (
pandas.core.frame.DataFrame
) – a panda dataframedrop_and_create (bool, optional) – whether to drop and recreate the dataset, defaults to False
dropAndCreate (bool, optional) – deprecated, use
drop_and_create
- write_dataframe(df, infer_schema=False, drop_and_create=False, **kwargs)#
Write a pandas dataframe to this dataset (or its target partition, if applicable).
This variant only edits the schema if infer_schema is True, otherwise you need to only write dataframes that have a compatible schema. Also see
write_with_schema()
.Caution
strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
Note
the dataset must be writable, ie declared as an output, except if you instantiated the dataset with ignore_flow=True
- Parameters:
df (
pandas.core.frame.DataFrame
) – a pandas dataframeinfer_schema (bool, optional) – whether to infer the schema from the dataframe, defaults to False
drop_and_create (bool, optional) – whether to drop and recreate the dataset, defaults to False
dropAndCreate (bool, optional) – deprecated, use
drop_and_create
- iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None, ascending=True)#
Get a generator of rows (as a dict-like object) in the data (or its selected partitions, if applicable).
Values are cast according to their types. String are parsed into “unicode” values.
- Parameters:
limit (int, optional) – limits the number of rows returned, defaults to None
sampling (str, optional) – sampling method, see
dataiku.core.dataset.create_sampling_argument()
. Defaults to ‘head’.sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None
limit – maximum number of rows to be emitted, defaults to None
ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count
log_every (int, optional) – print out the number of rows read on stdout, defaults to -1 (no log)
timeout (int, optional) – set a timeout in seconds, defaults to 30
columns (list[str], optional) – specify the desired columns, defaults to None (all columns)
ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True
- Yield:
- Return type:
generator
- raw_formatted_data(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None, read_session_id=None, filter_expression=None)#
Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported DSS output format.
Caution
You MUST close the file handle. Failure to do so will result in resource leaks.
After closing, you can also call
verify_read()
to check for any errors that occurred while reading the dataset data.import uuid import dataiku from dataiku.core.dataset import create_sampling_argument dataset = dataiku.Dataset("customers_partitioned") read_session_id = str(uuid.uuid4()) sampling = create_sampling_argument(sampling='head', limit=5) resp = dataset.raw_formatted_data(sampling=sampling, format="json", read_session_id=read_session_id) print(resp.data) resp.close() dataset.verify_read(read_session_id) #throw an exception if the read hasn't been fully completed print("read completed successfully")
- Parameters:
sampling (dict, optional) – a dict of sampling specs, see
dataiku.core.dataset.create_sampling_argument()
, defaults to Nonecolumns (list[str], optional) – list of desired columns, defaults to None (all columns)
format (str, optional) – output format, defaults to “tsv-excel-noheader”. Supported formats are : “json”, “tsv-excel-header” (tab-separated with header) and “tsv-excel-noheader” (tab-separated without header)
format_params (dict, optional) – dict of output format parameters, defaults to None
read_session_id (str, optional) – identifier of the read session, used to check at the end if the read was successful, defaults to None
filter_expression (str, optional) – expression used to filter data using formula language, defaults to None
- Returns:
an HTTP response
- Return type:
urllib3.response.HTTPResponse
- verify_read(read_session_id)#
Verify that no error occurred when using
raw_formatted_data()
to read a dataset.Use the same read_session_id that you passed to the call to
raw_formatted_data()
.- Parameters:
read_session_id (str) – identifier of the read session
- Raises:
Exception – if an error occured while the read
- iter_tuples(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None, ascending=True)#
Get the rows of the dataset as tuples.
The order and type of the values are the same are matching the dataset’s parameter
Values are cast according to their types. String are parsed into “unicode” values.
- Parameters:
limit (int, optional) – limits the number of rows returned, defaults to None
sampling (str, optional) – sampling method, see
dataiku.core.dataset.create_sampling_argument()
. Defaults to ‘head’.sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None
limit – maximum number of rows to be emitted, defaults to None (all)
ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count
log_every (int, optional) – print out the number of rows read on stdout, defaults to -1 (no log)
timeout (int, optional) – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DKU” processes, defaults to 30
columns (list[str], optional) – list of desired columns, defaults to None (all columns)
ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True
- Yield:
a tuples of columns values
- Return type:
generator
- get_writer()#
Get a stream writer for this dataset (or its target partition, if applicable).
Caution
The writer must be closed as soon as you don’t need it.
- Returns:
a stream writer
- Return type:
- get_continuous_writer(source_id, split_id=0)#
Get a stream writer for this dataset (or its target partition, if applicable).
dataset = dataiku.Dataset("wikipedia_dataset") dataset.write_schema([{"name":"data", "type":"string"}, ...]) with dataset.get_continuous_writer(...) as writer: for msg in message_iterator: writer.write_row_dict({"data":msg.data, ...}) writer.checkpoint("this_recipe", "some state")
- Parameters:
source_id (str) – identifier of the source of the stream
split_id (int, optional) – split id in the output (for concurrent usage), defaults to 0
- Returns:
a stream writer
- Return type:
dataiku.core.continuous_write.DatasetContinuousWriter
- write_schema(columns, drop_and_create=False, **kwargs)#
Write the dataset schema into the dataset JSON definition file.
Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset.
Caution
Obviously, this must be used with caution.
- Parameters:
columns (list) – see
read_schema()
drop_and_create (bool, optional) – whether to drop and recreate the dataset, defaults to False
dropAndCreate (bool, optional) – deprecated, use
drop_and_create
- write_schema_from_dataframe(df, drop_and_create=False, **kwargs)#
Set the schema of this dataset to the schema of a Pandas dataframe.
import dataiku input_ds = dataiku.Dataset("input_dataset") my_input_dataframe = input_ds.get_dataframe() output_ds = dataiku.Dataset("output_dataset") # Set the schema of "output_ds" to match the columns of "my_input_dataframe" output_ds.write_schema_from_dataframe(my_input_dataframe)
- Parameters:
df (
pandas.core.frame.DataFrame
) – a Pandas dataframedrop_and_create (bool, optional) – whether drop and recreate the dataset, defaults to False
dropAndCreate (bool, optional) – deprecated, use
drop_and_create
- read_metadata()#
Get the metadata attached to this dataset.
The metadata contains label, description checklists, tags and custom metadata of the dataset
- Returns:
the metadata as a dict, with fields:
label : label of the object (not defined for recipes)
description : description of the object (not defined for recipes)
checklists : checklists of the object, as a dict with a checklists field, which is a list of checklists, each a dict of fields:
id : identifier of the checklist
title : label of the checklist
createdBy : user who created the checklist
createdOn : timestamp of creation, in milliseconds
items : list of the items in the checklist, each a dict of
done : True if the item has been done
text : label of the item
createdBy : who created the item
createdOn : when the item was created, as a timestamp in milliseconds
stateChangedBy : who ticked the item as done (or not done)
stateChangedOn : when the item was last changed to done (or not done), as a timestamp in milliseconds
tags : list of tags, each a string
custom : custom metadata, as a dict with a kv field, which is a dict with any contents the user wishes
customFields : dict of custom field info (not defined for recipes)
- Return type:
dict
- write_metadata(meta)#
Write the metadata to the dataset.
Note
you should set a metadata that you obtained via
read_metadata()
then modified.- Parameters:
meta (dict) – metadata specifications as dict, see
read_metadata()
- get_config()#
Get the dataset config.
- Returns:
all dataset settings, with many relative to its type. main settings keys are:
type: type of the dataset such as “PostgreSQL”, “Filesystem”, etc…
name: name of the dataset
projectKey: project hosting the dataset
schema: dataset schema as dict with ‘columns’ definition
partitioning: partitions settings as dict
managed: True if the dataset is managed
readWriteOptions: dict of read or write options
versionTag: version info as dict with ‘versionNumber’, ‘lastModifiedBy’, and ‘lastModifiedOn’
creationTag: creation info as dict with ‘versionNumber’, ‘lastModifiedBy’, and ‘lastModifiedOn’
tags: list of tags
metrics: dict with a list of probes, see Metrics and checks
- Return type:
dict
- get_last_metric_values(partition='')#
Get the set of last values of the metrics on this dataset.
- Parameters:
partition (string) – (optional), the partition for which to fetch the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
- Return type:
- get_metric_history(metric_lookup, partition='')#
Get the set of all values a given metric took on this dataset.
- Parameters:
metric_lookup (string) – metric name or unique identifier
partition (string) – (optional), the partition for which to fetch the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
- Return type:
dict
- save_external_metric_values(values_dict, partition='')#
Save metrics on this dataset.
The metrics are saved with the type “external”.
- Parameters:
values_dict (dict) – the values to save, as a dict. The keys of the dict are used as metric names
partition (string) – (optional), the partition for which to save the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
- Return type:
dict
- get_last_check_values(partition='')#
Get the set of last values of the checks on this dataset.
- Parameters:
partition (string) – (optional), the partition for which to fetch the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
- Return type:
- save_external_check_values(values_dict, partition='')#
Save checks on this dataset.
The checks are saved with the type “external”
- Parameters:
values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names
partition (string) – (optional), the partition for which to save the values. On partitioned datasets, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
- Return type:
dict
- dataset.create_sampling_argument(sampling_column=None, limit=None, ratio=None, ascending=True)#
Generate sampling parameters. Please see https://doc.dataiku.com/dss/latest/explore/sampling.html#sampling-methods for more information.
- Parameters:
sampling (str, optional) – sampling method, see
dataiku.core.dataset.create_sampling_argument()
. Defaults to ‘head’.sampling_column (string, optional) – select the column used for “random-column” and “sort-column” sampling, defaults to None
limit (int, optional) – set the sampling max rows count, defaults to None
ratio (float, optional) – define the max row count as a ratio (between 0 and 1) of the dataset’s total row count
ascending (boolean, optional) – sort in ascending order the selected column of the “sort-column” sampling, defaults to True
- Returns:
sampling parameters
- Return type:
dict
- class dataiku.core.dataset_write.DatasetWriter(dataset)#
Handle to write to a dataset.
Important
Do not instantiate directly, use
dataiku.Dataset.get_writer()
instead.Attention
An instance of
DatasetWriter
MUST be closed after usage. Failure to close it will lead to incomplete or no data being written to the output dataset- active_writers = {}#
- static atexit_handler()#
- write_tuple(row)#
Write a single row from a tuple or list of column values.
Columns must be given in the order of the dataset schema. Strings MUST be given as Unicode object. Giving str objects will fail.
Note
The schema of the dataset MUST be set before using this.
- Parameters:
row (list) – a list of values, one per column in the output schema. Columns cannot be omitted.
- write_row_array(row)#
Write a single row from an array.
Caution
Deprecated, use
write_tuple()
- write_row_dict(row_dict)#
Write a single row from a dict of column name -> column value.
Some columns can be omitted, empty values will be inserted instead. Strings MUST be given as Unicode object. Giving str objects will fail.
Note
The schema of the dataset MUST be set before using this.
- Parameters:
row_dict (dict) – a dict of column name to column value
- write_dataframe(df)#
Append a Pandas dataframe to the dataset being written.
This method can be called multiple times (especially when you have been using
dataiku.Dataset.iter_dataframes()
to read from an input dataset).Strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
- Parameters:
df (DataFrame) – a Pandas dataframe
- close()#
Close this dataset writer.
- class dataiku.core.dataset.Schema(data)#
List of the definitions of the columns in a dataset.
Each column definition is a dict with at least a name field and a type field. Available columns types include:
type
note
sample value
string
b’foobar’
bigint
64 bits
9223372036854775807
int
32 bits
2147483647
smallint
16 bits
32767
tinyint
8 bits
127
double
64 bits
3.1415
float
32 bits
3.14
boolean
32 bits
true
date
string
“2020-12-31T00:00:00.101Z”
array
json string
‘[“foo”,”bar”]’
map
json string
‘{“foo”:”bar”}’
object
json string
‘{“foo”:{“bar”:[1,2,3]}}’
geopoint
string
“POINT(12 24)”
geometry
string
“POLYGON((1.1 1.1, 2.1 0, 0.1 0))”
Each column definition has fields:
name: name of the column as string
type: type of the column as string
maxLength: maximum length of values (when applicable, typically for string)
comment: user comment on the column
timestampNoTzAsDate: for columns of type “date” in non-managed datasets, whether the actual type in the underlying SQL database or file bears timezone information
originalType and originalSQLType: for columns in non-managed datasets, the name of the column type in the underlying SQL database or file
arrayContent: for array-typed columns, a column definition that applies to all elements in the array
mapKeys and mapValues: for map-types columns, a column definition that applies to all keys (resp. values) in the map
objectFields: for object-typed columns, a list of column definitions for the sub-fields in the object
- class dataiku.core.dataset.DatasetCursor(val, col_names, col_idx)#
A dataset cursor iterating on the rows.
Caution
you should not instantiate it manually, see
dataiku.Dataset.iter_rows()
- column_id(name)#
Get a column index from its name.
- Parameters:
name (str) – column name
- Returns:
the column index
- Return type:
int
- keys()#
Get the set of column names.
- Returns:
list of columns name
- Return type:
list[str]
- items()#
Get the full row.
- Returns:
a list of tuple (column, value)
- Return type:
list[tuple]
- values()#
Get values in the row.
- Returns:
list of columns values
- Return type:
list
- get(col_name, default_value=None)#
Get a value by its column name.
- Parameters:
col_name (str) – a column name
default_value (str, optional) – value to return if the column is not present, defaults to None
- Returns:
the value of the column
- Return type:
depends on the column’s type
The dataikuapi.dss.dataset package#
Main DSSDataset class#
- class dataikuapi.dss.dataset.DSSDataset(client, project_key, dataset_name)#
A dataset on the DSS instance. Do not instantiate this class, use
dataikuapi.dss.project.DSSProject.get_dataset()
- property id#
Get the dataset identifier.
- Return type:
string
- property name#
Get the dataset name.
- Return type:
string
- delete(drop_data=False)#
Delete the dataset.
- Parameters:
drop_data (bool) – Should the data of the dataset be dropped, defaults to False
- rename(new_name)#
Rename the dataset with the new specified name
- Parameters:
new_name (str) – the new name of the dataset
- get_settings()#
Get the settings of this dataset as a
DSSDatasetSettings
, or one of its subclasses.Know subclasses of
DSSDatasetSettings
includeFSLikeDatasetSettings
andSQLDatasetSettings
You must use
save()
on the returned object to make your changes effective on the dataset.# Example: activating discrete partitioning on a SQL dataset dataset = project.get_dataset("my_database_table") settings = dataset.get_settings() settings.add_discrete_partitioning_dimension("country") settings.save()
- Return type:
- get_definition()#
Get the raw settings of the dataset as a dict.
Caution
Deprecated. Use
get_settings()
- Return type:
dict
- set_definition(definition)#
Set the definition of the dataset
Caution
Deprecated. Use
get_settings()
andDSSDatasetSettings.save()
- Parameters:
definition (dict) – the definition, as a dict. You should only set a definition object that has been retrieved using the
get_definition()
call.
- exists()#
Test if the dataset exists.
- Returns:
whether this dataset exists
- Return type:
bool
- get_schema()#
Get the dataset schema.
- Returns:
a dict object of the schema, with the list of columns.
- Return type:
dict
- set_schema(schema)#
Set the dataset schema.
- Parameters:
schema (dict) – the desired schema for the dataset, as a dict. All columns have to provide their name and type.
- get_metadata()#
Get the metadata attached to this dataset. The metadata contains label, description checklists, tags and custom metadata of the dataset
- Returns:
a dict object. For more information on available metadata, please see https://doc.dataiku.com/dss/api/latest/rest/
- Return type:
dict
- set_metadata(metadata)#
Set the metadata on this dataset.
- Parameters:
metadata (dict) – the new state of the metadata for the dataset. You should only set a metadata object that has been retrieved using the
get_metadata()
call.
- iter_rows(partitions=None)#
Get the dataset data as a row-by-row iterator.
- Parameters:
partitions (Union[string, list[string]]) – (optional) partition identifier, or list of partitions to include, if applicable.
- Returns:
an iterator over the rows, each row being a list of values. The order of values in the list is the same as the order of columns in the schema returned by
get_schema()
- Return type:
generator[list]
- list_partitions()#
Get the list of all partitions of this dataset.
- Returns:
the list of partitions, as a list of strings.
- Return type:
list[string]
- clear(partitions=None)#
Clear data in this dataset.
- Parameters:
partitions (Union[string, list[string]]) – (optional) partition identifier, or list of partitions to clear. When not provided, the entire dataset is cleared.
- Returns:
a dict containing the method call status.
- Return type:
dict
- copy_to(target, sync_schema=True, write_mode='OVERWRITE')#
Copy the data of this dataset to another dataset.
- Parameters:
target (
dataikuapi.dss.dataset.DSSDataset
) – an object representing the target of this copy.sync_schema (bool) – (optional) update the target dataset schema to make it match the sourece dataset schema.
write_mode (string) – (optional) OVERWRITE (default) or APPEND. If OVERWRITE, the output dataset is cleared prior to writing the data.
- Returns:
a DSSFuture representing the operation.
- Return type:
- search_data_elastic(query_string, start=0, size=128, sort_columns=None, partitions=None)#
Caution
Only for datasets on Elasticsearch connections
Query the service with a search string to directly fetch data
- Parameters:
query_string (str) – Elasticsearch compatible query string
start (int) – row to start fetching the data
size (int) – number of results to return
sort_columns (list) – list of {“column”, “order”} dict, which is the order to fetch data. “order” is “asc” for ascending, “desc” for descending
partitions (list) – if the dataset is partitioned, a list of partition ids to search
- Returns:
a dict containing “columns”, “rows”, “warnings”, “found” (when start == 0)
- Return type:
dict
- build(job_type='NON_RECURSIVE_FORCED_BUILD', partitions=None, wait=True, no_fail=False)#
Start a new job to build this dataset and wait for it to complete. Raises if the job failed.
job = dataset.build() print("Job %s done" % job.id)
- Parameters:
job_type – the job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD
partitions – if the dataset is partitioned, a list of partition ids to build
wait (bool) – whether to wait for the job completion before returning the job handle, defaults to True
no_fail – if True, does not raise if the job failed.
- Returns:
the
dataikuapi.dss.job.DSSJob
job handle corresponding to the built job- Return type:
- synchronize_hive_metastore()#
Synchronize this dataset with the Hive metastore
- update_from_hive()#
Resynchronize this dataset from its Hive definition
- compute_metrics(partition='', metric_ids=None, probes=None)#
Compute metrics on a partition of this dataset.
If neither metric ids nor custom probes set are specified, the metrics setup on the dataset are used.
- Parameters:
partition (string) – (optional) partition identifier, use ALL to compute metrics on all data.
metric_ids (list[string]) – (optional) ids of the metrics to build
- Returns:
a metric computation report, as a dict
- Return type:
dict
- run_checks(partition='', checks=None)#
Run checks on a partition of this dataset.
If the checks are not specified, the checks setup on the dataset are used.
Caution
Deprecated. Use
dataikuapi.dss.data_quality.DSSDataQualityRuleSet.compute_rules()
instead- Parameters:
partition (str) – (optional) partition identifier, use ALL to run checks on all data.
checks (list[string]) – (optional) ids of the checks to run.
- Returns:
a checks computation report, as a dict.
- Return type:
dict
- uploaded_add_file(fp, filename)#
Add a file to an “uploaded files” dataset
- Parameters:
fp (file) – A file-like object that represents the file to upload
filename (str) – The filename for the file to upload
- uploaded_list_files()#
List the files in an “uploaded files” dataset.
- Returns:
uploaded files metadata as a list of dicts, with one dict per file.
- Return type:
list[dict]
- create_prediction_ml_task(target_variable, ml_backend_type='PY_MEMORY', guess_policy='DEFAULT', prediction_type=None, wait_guess_complete=True)#
Creates a new prediction task in a new visual analysis lab for a dataset.
- Parameters:
target_variable (str) – the variable to predict
ml_backend_type (str) – ML backend to use, one of PY_MEMORY, MLLIB or H2O (defaults to PY_MEMORY)
guess_policy (str) – Policy to use for setting the default parameters. Valid values are: DEFAULT, SIMPLE_FORMULA, DECISION_TREE, EXPLANATORY and PERFORMANCE (defaults to DEFAULT)
prediction_type (str) – The type of prediction problem this is. If not provided the prediction type will be guessed. Valid values are: BINARY_CLASSIFICATION, REGRESSION, MULTICLASS (defaults to None)
wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms (defaults to True). You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)
- Returns:
A ML task handle of type ‘PREDICTION’
- Return type:
- create_clustering_ml_task(input_dataset, ml_backend_type='PY_MEMORY', guess_policy='KMEANS', wait_guess_complete=True)#
Creates a new clustering task in a new visual analysis lab for a dataset.
The returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms.
You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)
- Parameters:
input_dataset (string) – The dataset to use for training/testing the model
ml_backend_type (str) – ML backend to use, one of PY_MEMORY, MLLIB or H2O (defaults to PY_MEMORY)
guess_policy (str) – Policy to use for setting the default parameters. Valid values are: KMEANS and ANOMALY_DETECTION (defaults to KMEANS)
wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms (defaults to True). You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)
- Returns:
A ML task handle of type ‘CLUSTERING’
- Return type:
- create_timeseries_forecasting_ml_task(target_variable, time_variable, timeseries_identifiers=None, guess_policy='TIMESERIES_DEFAULT', wait_guess_complete=True)#
Creates a new time series forecasting task in a new visual analysis lab for a dataset.
- Parameters:
target_variable (string) – The variable to forecast
time_variable (string) – Column to be used as time variable. Should be a Date (parsed) column.
timeseries_identifiers (list) – List of columns to be used as time series identifiers (when the dataset has multiple series)
guess_policy (string) – Policy to use for setting the default parameters. Valid values are: TIMESERIES_DEFAULT, TIMESERIES_STATISTICAL, and TIMESERIES_DEEP_LEARNING
wait_guess_complete (boolean) – If False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling
wait_guess_complete
on the returned object before doing anything else (in particular callingtrain
orget_settings
)
- Returns:
A ML task handle of type ‘PREDICTION’
- Return type:
- create_causal_prediction_ml_task(outcome_variable, treatment_variable, prediction_type=None, wait_guess_complete=True)#
Creates a new causal prediction task in a new visual analysis lab for a dataset.
- Parameters:
outcome_variable (string) – The outcome variable to predict.
treatment_variable (string) – The treatment variable.
prediction_type (string or None) – Valid values are: “CAUSAL_BINARY_CLASSIFICATION”, “CAUSAL_REGRESSION” or None (in this case prediction_type will be set by the Guesser)
wait_guess_complete (boolean) – If False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling
wait_guess_complete
on the returned object before doing anything else (in particular callingtrain
orget_settings
)
- Returns:
A ML task handle of type ‘PREDICTION’
- Return type:
- create_analysis()#
Create a new visual analysis lab for the dataset.
- Returns:
A visual analysis handle
- Return type:
dataikuapi.dss.analysis.DSSAnalysis
- list_analyses(as_type='listitems')#
List the visual analyses on this dataset
- Parameters:
as_type (str) – How to return the list. Supported values are “listitems” and “objects”, defaults to “listitems”
- Returns:
The list of the analyses. If “as_type” is “listitems”, each one as a dict, If “as_type” is “objects”, each one as a
dataikuapi.dss.analysis.DSSAnalysis
- Return type:
list
- delete_analyses(drop_data=False)#
Delete all analyses that have this dataset as input dataset. Also deletes ML tasks that are part of the analysis
- Parameters:
drop_data (bool) – whether to drop data for all ML tasks in the analysis, defaults to False
- list_statistics_worksheets(as_objects=True)#
List the statistics worksheets associated to this dataset.
- Parameters:
as_objects (bool) – if true, returns the statistics worksheets as
dataikuapi.dss.statistics.DSSStatisticsWorksheet
, else as a list of dicts- Return type:
- create_statistics_worksheet(name='My worksheet')#
Create a new worksheet in the dataset, and return a handle to interact with it.
- Parameters:
name (string) – name of the worksheet
- Returns:
a statistic worksheet handle
- Return type:
- get_statistics_worksheet(worksheet_id)#
Get a handle to interact with a statistics worksheet
- Parameters:
worksheet_id (string) – the ID of the desired worksheet
- Returns:
a statistic worksheet handle
- Return type:
- get_last_metric_values(partition='')#
Get the last values of the metrics on this dataset
- Parameters:
partition (string) – (optional) partition identifier, use ALL to retrieve metric values on all data.
- Returns:
a list of metric objects and their value
- Return type:
- get_metric_history(metric, partition='')#
Get the history of the values of the metric on this dataset
- Parameters:
metric (string) – id of the metric to get
partition (string) – (optional) partition identifier, use ALL to retrieve metric history on all data.
- Returns:
a dict containing the values of the metric, cast to the appropriate type (double, boolean,…)
- Return type:
dict
- get_info()#
Retrieve all the information about a dataset
- Returns:
a
DSSDatasetInfo
containing all the information about a dataset.- Return type:
- get_zone()#
Get the flow zone of this dataset
- Return type:
- move_to_zone(zone)#
Move this object to a flow zone
- Parameters:
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to move the object
Share this object to a flow zone
- Parameters:
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to share the object
Unshare this object from a flow zone
- Parameters:
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
from where to unshare the object
- get_usages()#
Get the recipes or analyses referencing this dataset
- Returns:
a list of usages
- Return type:
list[dict]
- get_object_discussions()#
Get a handle to manage discussions on the dataset
- Returns:
the handle to manage discussions
- Return type:
- test_and_detect(infer_storage_types=False)#
Used internally by
autodetect_settings()
It is not usually required to call this method- Parameters:
infer_storage_types (bool) – whether to infer storage types
- autodetect_settings(infer_storage_types=False)#
Detect appropriate settings for this dataset using Dataiku detection engine
- Parameters:
infer_storage_types (bool) – whether to infer storage types
- Returns:
new suggested settings that you can
DSSDatasetSettings.save()
- Return type:
DSSDatasetSettings
or a subclass
- get_as_core_dataset()#
Get the
dataiku.Dataset
object corresponding to this dataset- Return type:
- new_code_recipe(type, code=None, recipe_name=None)#
Start the creation of a new code recipe taking this dataset as input.
- Parameters:
type (str) – type of the recipe (‘python’, ‘r’, ‘pyspark’, ‘sparkr’, ‘sql’, ‘sparksql’, ‘hive’, …).
code (str) – the code of the recipe.
recipe_name (str) – (optional) base name for the new recipe.
- Returns:
a handle to the new recipe’s creator object.
- Return type:
Union[
dataikuapi.dss.recipe.CodeRecipeCreator
,dataikuapi.dss.recipe.PythonRecipeCreator
]
- new_recipe(type, recipe_name=None)#
Start the creation of a new recipe taking this dataset as input. For more details, please see
dataikuapi.dss.project.DSSProject.new_recipe()
- Parameters:
type (str) – type of the recipe (‘python’, ‘r’, ‘pyspark’, ‘sparkr’, ‘sql’, ‘sparksql’, ‘hive’, …).
recipe_name (str) – (optional) base name for the new recipe.
- get_data_quality_rules()#
Get a handle to interact with the data quality rules of the dataset.
- Returns:
A handle to the data quality rules of the dataset.
- Return type:
- get_column_lineage(column, max_dataset_count=None)#
Get the full lineage (auto-computed and manual) information of a column in this dataset. Column relations with datasets from both local and foreign projects will be included in the result.
- Parameters:
column (str) – name of the column to retrieve the lineage on.
max_dataset_count (integer) – (optional) the maximum number of datasets to query for. If none, then the max hard limit is used.
- Returns:
the full column lineage (auto-computed and manual) as a list of relations.
- Return type:
list of dict
Listing datasets#
- class dataikuapi.dss.dataset.DSSDatasetListItem(client, data)#
An item in a list of datasets.
Caution
Do not instantiate this class, use
dataikuapi.dss.project.DSSProject.list_datasets()
- to_dataset()#
Gets a handle on the corresponding dataset.
- Returns:
a handle on a dataset
- Return type:
- property name#
Get the name of the dataset.
- Return type:
string
- property id#
Get the identifier of the dataset.
- Return type:
string
- property type#
Get the type of the dataset.
- Return type:
string
- property schema#
Get the dataset schema as a dict.
- Returns:
a dict object of the schema, with the list of columns. See
DSSDataset.get_schema()
- Return type:
dict
- property connection#
Get the name of the connection on which this dataset is attached, or None if there is no connection for this dataset.
- Return type:
string
- get_column(column)#
Get a given column in the dataset schema by its name.
- Parameters:
column (str) – name of the column to find
- Returns:
the column settings or None if column does not exist
- Return type:
dict
Settings of datasets#
- class dataikuapi.dss.dataset.DSSDatasetSettings(dataset, settings)#
Base settings class for a DSS dataset.
Caution
Do not instantiate this class directly, use
DSSDataset.get_settings()
Use
save()
to save your changes- get_raw()#
Get the raw dataset settings as a dict.
- Return type:
dict
- get_raw_params()#
Get the type-specific params, as a raw dict.
- Return type:
dict
- property type#
Returns the settings type as a string.
- Return type:
string
- property schema_columns#
Get the schema columns settings.
- Returns:
a list of dicts with column settings.
- Return type:
list[dict]
- remove_partitioning()#
Reset partitioning settings to those of a non-partitioned dataset.
- add_discrete_partitioning_dimension(dim_name)#
Add a discrete partitioning dimension to settings.
- Parameters:
dim_name (string) – name of the partition to add.
- add_time_partitioning_dimension(dim_name, period='DAY')#
Add a time partitioning dimension to settings.
- Parameters:
dim_name (string) – name of the partition to add.
period (string) – (optional) time granularity of the created partition. Can be YEAR, MONTH, DAY, HOUR.
- add_raw_schema_column(column)#
Add a column to the schema settings.
- Parameters:
column (dict) – column settings to add.
- property is_feature_group#
Indicates whether the Dataset is defined as a Feature Group, available in the Feature Store.
- Return type:
bool
- set_feature_group(status)#
(Un)sets the dataset as a Feature Group, available in the Feature Store. Changes of this property will be applied when calling
save()
and require the “Manage Feature Store” permission.- Parameters:
status (bool) – whether the dataset should be defined as a feature group
- save()#
Save settings.
- class dataikuapi.dss.dataset.SQLDatasetSettings(dataset, settings)#
Settings for a SQL dataset. This class inherits from
DSSDatasetSettings
.Caution
Do not instantiate this class directly, use
DSSDataset.get_settings()
Use
save()
to save your changes- set_table(connection, schema, table, catalog=None)#
Sets this SQL dataset in ‘table’ mode, targeting a particular table of a connection Leave catalog to None to target the default database associated with the connection
- class dataikuapi.dss.dataset.FSLikeDatasetSettings(dataset, settings)#
Settings for a files-based dataset. This class inherits from
DSSDatasetSettings
.Caution
Do not instantiate this class directly, use
DSSDataset.get_settings()
Use
save()
to save your changes- set_connection_and_path(connection, path)#
Set connection and path parameters.
- Parameters:
connection (string) – connection to use.
path (string) – path to use.
- get_raw_format_params()#
Get the raw format parameters as a dict.
- Return type:
dict
- set_format(format_type, format_params=None)#
Set format parameters.
- Parameters:
format_type (string) – format type to use.
format_params (dict) – dict of parameters to assign to the formatParams settings section.
- set_csv_format(separator=',', style='excel', skip_rows_before=0, header_row=True, skip_rows_after=0)#
Set format parameters for a csv-based dataset.
- Parameters:
separator (string) – (optional) separator to use, default is ‘,’”.
style (string) – (optional) style to use, default is ‘excel’.
skip_rows_before (int) – (optional) number of rows to skip before header, default is 0.
header_row (bool) – (optional) wheter or not the header row is parsed, default is true.
skip_rows_after (int) – (optional) number of rows to skip before header, default is 0.
- set_partitioning_file_pattern(pattern)#
Set the dataset partitionning file pattern.
- Parameters:
pattern (str) – pattern to set.
Dataset Information#
- class dataikuapi.dss.dataset.DSSDatasetInfo(dataset, info)#
Info class for a DSS dataset (Read-Only).
Caution
Do not instantiate this class directly, use
DSSDataset.get_info()
- get_raw()#
Get the raw dataset full information as a dict
- Returns:
the raw dataset full information
- Return type:
dict
- property last_build_start_time#
The last build start time of the dataset as a
datetime.datetime
or None if there is no last build information.- Returns:
the last build start time
- Return type:
datetime.datetime
or None
- property last_build_end_time#
The last build end time of the dataset as a
datetime.datetime
or None if there is no last build information.- Returns:
the last build end time
- Return type:
datetime.datetime
or None
- property is_last_build_successful#
Get whether the last build of the dataset is successful.
- Returns:
True if the last build is successful
- Return type:
bool
Creation of managed datasets#
- class dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper(project, dataset_name)#
Provide an helper to create partitioned dataset
import dataiku client = dataiku.api_client() project_key = dataiku.default_project_key() project = client.get_project(project_key) #create the dataset builder = project.new_managed_dataset("py_generated") builder.with_store_into("filesystem_folders") dataset = builder.create(overwrite=True) #setup format & schema settings ds_settings = ds.get_settings() ds_settings.set_csv_format() ds_settings.add_raw_schema_column({'name':'id', 'type':'int'}) ds_settings.add_raw_schema_column({'name':'name', 'type':'string'}) ds_settings.save() #put some data data = ["foo", "bar"] with ds.get_as_core_dataset().get_writer() as writer: for idx, val in enumerate(data): writer.write_row_array((idx, val))
Caution
Do not instantiate directly, use
dataikuapi.dss.project.DSSProject.new_managed_dataset()
- get_creation_settings()#
Get the dataset creation settings as a dict.
- Return type:
dict
- with_store_into(connection, type_option_id=None, format_option_id=None)#
Sets the connection into which to store the new managed dataset
- Parameters:
connection (str) – Name of the connection to store into
type_option_id (str) – If the connection accepts several types of datasets, the type
format_option_id (str) – Optional identifier of a file format option
- Returns:
self
- with_copy_partitioning_from(dataset_ref, object_type='DATASET')#
Sets the new managed dataset to use the same partitioning as an existing dataset
- Parameters:
dataset_ref (str) – Name of the dataset to copy partitioning from
object_type (str) – Type of the object to copy partitioning from, values can be DATASET or FOLDER
- Returns:
self
- create(overwrite=False)#
Executes the creation of the managed dataset according to the selected options
- Parameters:
overwrite (bool, optional) – If the dataset being created already exists, delete it first (removing data), defaults to False
- Returns:
the newly created dataset
- Return type:
- already_exists()#
Check if dataset already exists.
- Returns:
whether this managed dataset already exists
- Return type:
bool