Managed folders#

Note

There are two main classes related to managed folder handling in Dataiku’s Python APIs:

Both classes have fairly similar capabilities, but we recommend using dataiku.Folder within DSS.

For usage information and examples, see Managed folders

dataiku package#

class dataiku.Folder(lookup, project_key=None, ignore_flow=False)#

Handle to interact with a folder.

Note

This class is also available as dataiku.Folder

get_info(sensitive_info=False)#

Get information about the location and settings of this managed folder

Usage sample:

# construct the URL to a S3 object
folder = dataiku.Folder("my_folder_name")
folder_info = folder.get_info()
access_info = folder_info["accessInfo"]
folder_base_url = 's3://%s%s' % (access_info['bucket'], access_info['root'])
target_url = '%s/some/path/to/a/file' % folder_base_url
Parameters:

sensitive_info (boolean) – if True, the credentials of the connection of the managed folder are returned, if they’re accessible to the user. (default: False)

Returns:

information about the folder. Fields are:

  • id : identifier of the folder

  • projectKey : project of the folder

  • name : name of the folder

  • type : type of the folder (S3 / HDFS / GCS / …)

  • directoryBasedPartitioning : whether the partitioning schema of the folder (if any) maps partitions to sub-folders

  • path : path of the folder of the filesystem, for folder on the local filesystem

  • accessInfo : extra information about the filesystem underlying the folder. The exact fields depend on the folder type. Typically contains the connection root and the parts needed to build a full URI, like bucket and storage account name. If the sensitive_info parameter is True, then credentials of the connection will be added (if accessible to the user)

Return type:

dict

get_partition_info(partition)#

Get information about a partition of this managed folder

Parameters:

partition (string) – partition identifier

Returns:

information about the partition. Fields are:

  • id : identifier of the folder

  • projectKey : project of the folder

  • name : name of the folder

  • folder : if the partitioning scheme maps partitions to a subfolder, the path of the subfolder within the managed folder

  • paths : paths of the files in the partition, relative to the managed folder

Return type:

dict

get_path()#

Get the filesystem path of this managed folder.

Important

This method can only be called for managed folders that are stored on the local filesystem of the DSS server. For non-filesystem managed folders (HDFS, S3, …), you need to use the various read/download and write/upload APIs.

Usage example:

# read a model off a local managed folder
folder = dataiku.Folder("folder_where_models_are_stored")
with open(os.path.join(f.get_path(), "path/to/model.pkl"), 'rb') as fd:
    model = pickle.load(fd)
Returns:

a path on the local filesystem that the python process can read from and write to

Return type:

string

is_partitioning_directory_based()#

Whether the partitioning of the folder maps partitions to sub-directories.

Return type:

boolean

list_paths_in_partition(partition='')#

Gets the paths of the files for the given partition.

Parameters:

partition (string) – identifier of the partition. Use ‘’ to get the paths of all the files in the folder, regardless of the partition

Returns:

a list of paths within the folder

Return type:

list[string]

list_partitions()#

Get the partitions in the folder.

Returns:

a list of partition identifiers

Return type:

list[string]

get_partition_folder(partition)#

Get the filesystem path of the directory corresponding to the partition (if the partitioning is directory-based).

Parameters:

partition (string) – identifier of the partition

Returns:

sub-path inside the folder that corresponds to the partition. None if the partitioning scheme doesn’t map partitions to sub-folders

Return type:

string

get_id()#

Get the identifier of the folder.

Return type:

string

get_name()#

Get the name of the folder.

Return type:

string

file_path(filename)#

Get the filesystem path for a given file within the folder.

Important

This method can only be called for managed folders that are stored on the local filesystem of the DSS server. For non-filesystem managed folders (HDFS, S3, …), you need to use the various read/download and write/upload APIs.

Parameters:

filename (string) – path of the file within the folder

Returns:

the full path of the file on the local filesystem

Return type:

string

read_json(filename)#

Read a JSON file within the folder and return its parsed content.

Usage example:

folder = dataiku.Folder("my_folder_id")
# write a JSON-serializable object
folder.write_json("/some/path/in/folder", my_object)

# read back the object
my_object_again = folder.read_json("/some/path/in/folder")            
Parameters:

filename (string) – path of the file within the folder

Returns:

the content of the file

Return type:

list or dict, depending on the content of the file

write_json(filename, obj)#

Write a JSON-serializable object as JSON to a file within the folder.

Parameters:
  • filename (string) – Path of the target file within the folder

  • obj (object) – JSON-serializable object to write (generally dict or list)

clear()#

Remove all files from the folder.

clear_partition(partition)#

Remove all files from a specific partition of the folder.

Parameters:

partition (string) – identifier of the partition to clear

clear_path(path)#

Remove a file or directory from the managed folder.

Caution

Deprecated. use delete_path() instead

Parameters:

path (string) – path inside the folder to the file or directory to delete

delete_path(path)#

Remove a file or directory from the managed folder.

Parameters:

path (string) – path inside the folder to the file or directory to delete

get_path_details(path='/')#

Get details about a specific path (file or directory) in the folder.

Parameters:

path (string) – path inside the folder to the file or directory

Returns:

information about the file or folder at path, as a dict. Fields are:

  • exists : whether there is a file or folder at path

  • directory : True if path denotes a directory in the managed folder, False if it’s a file

  • fullPath : path inside the folder

  • size : if a file, the size in bytes of the file

  • lastModified : last modification time of the file or directory at path, in milliseconds since epoch

  • mimeType : for files, the detected MIME type

  • children : for directories, a list of the contents of the directory, each element having the present structure (not recursive)

Return type:

dict

get_download_stream(path)#

Get a file-like object that allows you to read a single file from this folder.

Usage example:

with folder.get_download_stream("myfile") as stream:
    data = stream.readline()
    print("First line of myfile is: {}".format(data))

Note

The file-like returned by this method is not seekable.

Parameters:

path (string) – path inside the managed folder

Returns:

the data of the file at path inside the managed folder

Return type:

file-like

upload_stream(path, f)#

Upload the content of a file-like object to a specific path in the managed folder.

If the file already exists, it will be replaced.

# This copies a local file to the managed folder
with open("local_file_to_upload") as f:
    folder.upload_stream("name_of_file_in_folder", f)
Parameters:
  • path (string) – Target path of the file to write in the managed folder

  • f (stream) – file-like object open for reading

upload_file(path, file_path)#

Upload a local file to a specific path in the managed folder.

If the file already exists, it will be replaced.

Parameters:
  • path (string) – Target path of the file to write in the managed folder

  • file_path (string) – Absolute path to a local file

upload_data(path, data)#

Upload binary data to a specific path in the managed folder.

If the file already exists, it will be replaced.

Parameters:
  • path (string) – Target path of the file to write in the managed folder

  • data (bytes) – str or unicode data to upload

get_writer(path)#

Get a writer object to write incrementally to a specific path in the managed folder.

If the file already exists, it will be replaced.

Parameters:

path (string) – Target path of the file to write in the managed folder

Return type:

dataiku.core.managed_folder.ManagedFolderWriter

get_last_metric_values(partition='')#

Get the set of last values of the metrics on this folder.

Parameters:

partition (string) – (optional) a partition identifier to get metrics for. If not set, returns the metrics of a non-partitioned folder, and the metrics of the whole managed folder for a partitioned managed folder

Return type:

dataiku.core.metrics.ComputedMetrics

get_metric_history(metric_lookup, partition='')#

Get the set of all values a given metric took on this folder.

Parameters:
  • metric_lookup (string) – metric name or unique identifier

  • partition (string) – (optional) a partition identifier to get metrics for. If not set, returns the metrics of a non-partitioned folder, and the metrics of the whole managed folder for a partitioned managed folder

Returns:

an object containing the values of the metric_lookup metric, cast to the appropriate type (double, boolean, …). Top-level fields are:

  • metricId : identifier of the metric

  • metric : dict of the metric’s definition

  • valueType : type of the metric values in the values array

  • lastValue : most recent value, as a dict of:

    • time : timestamp of the value computation

    • value : value of the metric at time

  • values : list of values, each one a dict of the same structure as lastValue

Return type:

dict

save_external_metric_values(values_dict, partition='')#

Save metrics on this folder. The metrics are saved with the type “external”.

Parameters:
  • values_dict (dict) – the values to save, as a dict. The keys of the dict are used as metric names

  • partition (string) – (optional), the partition for which to fetch the values. On partitioned folders, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

get_last_check_values(partition='')#

Get the set of last values of the checks on this folder, as a dataiku.core.metrics.ComputedChecks object

Parameters:

partition (string) – (optional), the partition for which to fetch the values. On partitioned folders, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

Return type:

dataiku.core.metrics.ComputedChecks

save_external_check_values(values_dict, partition='')#

Save checks on this folder. The checks are saved with the type “external”.

Parameters:
  • values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names

  • partition (string) – (optional), the partition for which to fetch the values. On partitioned folders, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL

dataikuapi package#

Use this class preferably outside of DSS

class dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, odb_id)#

A handle to interact with a managed folder on the DSS instance.

Important

Do not create this class directly, instead use dataikuapi.dss.project.get_managed_folder()

property id#

Returns the internal identifier of the managed folder, which is a 8-character random string, not to be confused with the managed folder’s name.

Return type:

string

delete()#

Delete the managed folder from the flow, and objects using it (recipes or labeling tasks)

Attention

This call doesn’t delete the managed folder’s contents

get_definition()#

Get the definition of this managed folder. The definition contains name, description checklists, tags, connection and path parameters, metrics and checks setup.

Caution

Deprecated. Please use get_settings()

Returns:

the managed folder definition.

Return type:

dict

set_definition(definition)#

Set the definition of this managed folder.

Caution

Deprecated. Please use get_settings() then save()

Note

the fields id and projectKey can’t be modified

Usage example:

folder_definition = folder.get_definition()
folder_definition['tags'] = ['tag1','tag2']
folder.set_definition(folder_definition)
Parameters:

definition (dict) – the new state of the definition for the folder. You should only set a definition object that has been retrieved using the get_definition() call

Returns:

a message upon successful completion of the definition update. Only contains one msg field

Return type:

dict

get_settings()#

Returns the settings of this managed folder as a DSSManagedFolderSettings.

You must use save() on the returned object to make your changes effective on the managed folder.

# Example: activating discrete partitioning
folder = project.get_managed_folder("my_folder_id")
settings = folder.get_settings()
settings.add_discrete_partitioning_dimension("country")
settings.save()
Returns:

the settings of the managed folder

Return type:

DSSManagedFolderSettings

list_contents()#

Get the list of files in the managed folder

Usage example:

for content in folder.list_contents()['items']:
    last_modified_seconds = content["lastModified"] / 1000
    last_modified_str = datetime.fromtimestamp(last_modified_seconds).strftime("%Y-%m-%d %H:%m:%S")
    print("size=%s mtime=%s %s" % (content["size"], last_modified_str, content["path"]))
Returns:

the list of files, in the items field. Each item has fields:

  • path : path of the file inside the folder

  • size : size of the file in bytes

  • lastModified : last modification time, in milliseconds since epoch

Return type:

dict

get_file(path)#

Get a file from the managed folder

Usage example:

with folder.get_file("/kaggle_titanic_train.csv") as fd:
    df = pandas.read_csv(fd.raw)
Parameters:

path (string) – the path of the file to read within the folder

Returns:

the HTTP request to stream the data from

Return type:

requests.models.Response

delete_file(path)#

Delete a file from the managed folder

Parameters:

path (string) – the path of the file to read within the folder

Note

No error is raised if the file doesn’t exist

put_file(path, f)#

Upload the file to the managed folder. If the file already exists in the folder, it is overwritten.

Usage example:

with open("./some_local.csv") as fd:
    uploaded = folder.put_file("/target.csv", fd).json()
    print("Uploaded %s bytes" % uploaded["size"])
Parameters:
  • path (string) – the path of the file to write within the folder

  • f (file) – a file-like

Note

if using a string for the f parameter, the string itself is taken as the file content to upload

Returns:

information on the file uploaded to the folder, as a dict of:

  • path : path of the file inside the folder

  • size : size of the file in bytes

  • lastModified : last modification time, in milliseconds since epoch

Return type:

dict

upload_folder(path, folder)#

Upload the content of a folder to a managed folder.

Note

upload_folder(“/some/target”, “./a/source/”) will result in “target” containing the contents of “source”, but not the “source” folder being a child of “target”

Parameters:
  • path (str) – the destination path of the folder in the managed folder

  • folder (str) – local path (absolute or relative) of the source folder to upload

compute_metrics(metric_ids=None, probes=None)#

Compute metrics on this managed folder.

Usage example:

future_resp = folder.compute_metrics()
future = DSSFuture(client, future_resp.get("jobId", None), future_resp)
metrics = future.wait_for_result()
print("Computed in %s ms" % (metrics["endTime"] - metrics["startTime"]))
for computed in metrics["computed"]:
    print("Metric %s = %s" % (computed["metricId"], computed["value"]))
Parameters:
  • metric_ids (list[string]) – (optional) identifiers of metrics to compute, among the metrics defined on the folder

  • probes (dict) – (optional) definition of metrics probes to use, in place of the ones defined on the folder. The current set of probes on the folder is the probes field in the dict returned by get_definition()

Returns:

a future as dict representing the task of computing the probes

Return type:

dict

get_last_metric_values()#

Get the last values of the metrics on this managed folder.

Returns:

a handle on the values of the metrics

Return type:

dataikuapi.dss.metrics.ComputedMetrics

get_metric_history(metric)#

Get the history of the values of a metric on this managed folder.

Usage example:

history = folder.get_metric_history("basic:COUNT_FILES")
for value in history["values"]:
    time_str = datetime.fromtimestamp(value["time"] / 1000).strftime("%Y-%m-%d %H:%m:%S")
    print("%s : %s" % (time_str, value["value"]))
Parameters:

metric (string) – identifier of the metric to get values of

Returns:

an object containing the values of the metric, cast to the appropriate type (double, boolean,…). The identifier of the metric is in a metricId field.

Return type:

dict

get_zone()#

Get the flow zone of this managed folder.

Returns:

a flow zone

Return type:

dataikuapi.dss.flow.DSSFlowZone

move_to_zone(zone)#

Move this object to a flow zone.

Parameters:

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to move the object, or its identifier

share_to_zone(zone)#

Share this object to a flow zone.

Parameters:

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to share the object, or its identifier

unshare_from_zone(zone)#

Unshare this object from a flow zone.

Parameters:

zone (object) – a dataikuapi.dss.flow.DSSFlowZone from where to unshare the object, or its identifier

get_usages()#

Get the recipes referencing this folder.

Usage example:

for usage in folder.get_usages():
    if usage["type"] == 'RECIPE_INPUT':
        print("Used as input of %s" % usage["objectId"])
Returns:

a list of usages, each one a dict of:

  • type : the type of usage, either “RECIPE_INPUT” or “RECIPE_OUTPUT”

  • objectId : name of the recipe

  • objectProjectKey : project of the recipe

Return type:

list[dict]

get_object_discussions()#

Get a handle to manage discussions on the managed folder.

Returns:

the handle to manage discussions

Return type:

dataikuapi.dss.discussion.DSSObjectDiscussions

copy_to(target, write_mode='OVERWRITE')#

Copy the data of this folder to another folder.

Parameters:

target (object) – a dataikuapi.dss.managedfolder.DSSManagedFolder representing the target location of this copy

Returns:

a DSSFuture representing the operation

Return type:

dataikuapi.dss.future.DSSFuture

create_dataset_from_files(dataset_name)#

Create a new dataset of type ‘FilesInFolder’, taking its files from this managed folder, and return a handle to interact with it.

The created dataset does not have its format and schema initialized, it is recommended to use autodetect_settings() on the returned object

Parameters:

dataset_name (str) – the name of the dataset to create. Must not already exist

Returns:

A dataset handle

Return type:

dataikuapi.dss.dataset.DSSDataset

class dataikuapi.dss.managedfolder.DSSManagedFolderSettings(folder, settings)#

Base settings class for a DSS managed folder. Do not instantiate this class directly, use DSSDSSManagedFolderSettings.get_settings()

Use save() to save your changes

get_raw()#

Get the managed folder settings.

Returns:

the settings, as a dict. The definition of the actual location of the files in the managed folder is a params sub-dict.

Return type:

dict

get_raw_params()#

Get the type-specific (S3/ filesystem/ HDFS/ …) params as a dict.

Returns:

the type-specific patams. Each type defines a set of fields; commonly found fields are :

  • connection : name of the connection used by the managed folder

  • path : root of the managed folder within the connection

  • bucket or container : the bucket/container name on cloud storages

Return type:

dict

property type#

Get the type of filesystem that the managed folder uses.

Return type:

string

save()#

Save the changes to the settings on the managed folder.

Usage example:

folder = project.get_managed_folder("my_folder_id")
settings = folder.get_settings()
settings.set_connection_and_path("some_S3_connection", None)
settings.get_raw_params()["bucket"] = "some_S3_bucket"
settings.save()
remove_partitioning()#

Make the managed folder non-partitioned.

add_discrete_partitioning_dimension(dim_name)#

Add a discrete partitioning dimension.

Parameters:

dim_name (string) – name of the partitioning dimension

add_time_partitioning_dimension(dim_name, period='DAY')#

Add a time partitioning dimension.

Parameters:
  • dim_name (string) – name of the partitioning dimension

  • period (string) – granularity of the partitioning dimension (YEAR, MONTH, DAY (default), HOUR)

set_partitioning_file_pattern(pattern)#

Set the partitioning pattern of the folder. The pattern indicates which paths inside the folder belong to which partition. Partition dimensions are written with:

  • %{dim_name} for discrete dimensions

  • %Y (=year), %M (=month), %D (=day) and %H (=hour) for time dimensions

Besides the %… variables for injecting the partition dimensions, the pattern is a regular expression.

Usage example:

# partition a managed folder by month
folder = project.get_managed_folder("my_folder_id")
settings = folder.get_settings()
settings.add_time_partitioning_dimension("my_date", "MONTH")
settings.set_partitioning_file_pattern("/year=%Y/month=%M/.*")
settings.save()
Parameters:

pattern (string) – the partitioning pattern

set_connection_and_path(connection, path)#

Change the managed folder connection and/or path.

Note

When changing the connection or path, the folder’s files aren’t moved or copied to the new location

Attention

When changing the connection for a connection with a different type, for example going from a S3 connection to an Azure Blob Storage connection, only the managed folder type is changed. Type-specific fields are not converted. In the example of a S3 to Azure conversion, the S3 bucket isn’t converted to a storage account container.

Parameters:
  • connection (string) – the name of a file-based connection. If None, the connection of the managed folder is left unchanged

  • path (string) – a path relative to the connection root. If None, the path of the managed folder is left unchanged