Managed folders#
Note
There are two main classes related to managed folder handling in Dataiku’s Python APIs:
dataiku.Folder
in the dataiku package. It was initially designed for usage within DSS in recipes and Jupyter notebooks.dataikuapi.dss.managedfolder.DSSManagedFolder
in the dataikuapi package. It was initially designed for usage outside of DSS.
Both classes have fairly similar capabilities, but we recommend using dataiku.Folder within DSS.
For usage information and examples, see Managed folders
dataiku package#
- class dataiku.Folder(lookup, project_key=None, ignore_flow=False)#
Handle to interact with a folder.
Note
This class is also available as
dataiku.Folder
- get_info(sensitive_info=False)#
Get information about the location and settings of this managed folder
Usage sample:
# construct the URL to a S3 object folder = dataiku.Folder("my_folder_name") folder_info = folder.get_info() access_info = folder_info["accessInfo"] folder_base_url = 's3://%s%s' % (access_info['bucket'], access_info['root']) target_url = '%s/some/path/to/a/file' % folder_base_url
- Parameters:
sensitive_info (boolean) – if True, the credentials of the connection of the managed folder are returned, if they’re accessible to the user. (default: False)
- Returns:
information about the folder. Fields are:
id : identifier of the folder
projectKey : project of the folder
name : name of the folder
type : type of the folder (S3 / HDFS / GCS / …)
directoryBasedPartitioning : whether the partitioning schema of the folder (if any) maps partitions to sub-folders
path : path of the folder of the filesystem, for folder on the local filesystem
accessInfo : extra information about the filesystem underlying the folder. The exact fields depend on the folder type. Typically contains the connection root and the parts needed to build a full URI, like bucket and storage account name. If the sensitive_info parameter is True, then credentials of the connection will be added (if accessible to the user)
- Return type:
dict
- get_partition_info(partition)#
Get information about a partition of this managed folder
- Parameters:
partition (string) – partition identifier
- Returns:
information about the partition. Fields are:
id : identifier of the folder
projectKey : project of the folder
name : name of the folder
folder : if the partitioning scheme maps partitions to a subfolder, the path of the subfolder within the managed folder
paths : paths of the files in the partition, relative to the managed folder
- Return type:
dict
- get_path()#
Get the filesystem path of this managed folder.
Important
This method can only be called for managed folders that are stored on the local filesystem of the DSS server. For non-filesystem managed folders (HDFS, S3, …), you need to use the various read/download and write/upload APIs.
Usage example:
# read a model off a local managed folder folder = dataiku.Folder("folder_where_models_are_stored") with open(os.path.join(f.get_path(), "path/to/model.pkl"), 'rb') as fd: model = pickle.load(fd)
- Returns:
a path on the local filesystem that the python process can read from and write to
- Return type:
string
- is_partitioning_directory_based()#
Whether the partitioning of the folder maps partitions to sub-directories.
- Return type:
boolean
- list_paths_in_partition(partition='')#
Gets the paths of the files for the given partition.
- Parameters:
partition (string) – identifier of the partition. Use ‘’ to get the paths of all the files in the folder, regardless of the partition
- Returns:
a list of paths within the folder
- Return type:
list[string]
- list_partitions()#
Get the partitions in the folder.
- Returns:
a list of partition identifiers
- Return type:
list[string]
- get_partition_folder(partition)#
Get the filesystem path of the directory corresponding to the partition (if the partitioning is directory-based).
- Parameters:
partition (string) – identifier of the partition
- Returns:
sub-path inside the folder that corresponds to the partition. None if the partitioning scheme doesn’t map partitions to sub-folders
- Return type:
string
- get_id()#
Get the identifier of the folder.
- Return type:
string
- get_name()#
Get the name of the folder.
- Return type:
string
- file_path(filename)#
Get the filesystem path for a given file within the folder.
Important
This method can only be called for managed folders that are stored on the local filesystem of the DSS server. For non-filesystem managed folders (HDFS, S3, …), you need to use the various read/download and write/upload APIs.
- Parameters:
filename (string) – path of the file within the folder
- Returns:
the full path of the file on the local filesystem
- Return type:
string
- read_json(filename)#
Read a JSON file within the folder and return its parsed content.
Usage example:
folder = dataiku.Folder("my_folder_id") # write a JSON-serializable object folder.write_json("/some/path/in/folder", my_object) # read back the object my_object_again = folder.read_json("/some/path/in/folder")
- Parameters:
filename (string) – path of the file within the folder
- Returns:
the content of the file
- Return type:
list or dict, depending on the content of the file
- write_json(filename, obj)#
Write a JSON-serializable object as JSON to a file within the folder.
- Parameters:
filename (string) – Path of the target file within the folder
obj (object) – JSON-serializable object to write (generally dict or list)
- clear()#
Remove all files from the folder.
- clear_partition(partition)#
Remove all files from a specific partition of the folder.
- Parameters:
partition (string) – identifier of the partition to clear
- clear_path(path)#
Remove a file or directory from the managed folder.
Caution
Deprecated. use
delete_path()
instead- Parameters:
path (string) – path inside the folder to the file or directory to delete
- delete_path(path)#
Remove a file or directory from the managed folder.
- Parameters:
path (string) – path inside the folder to the file or directory to delete
- get_path_details(path='/')#
Get details about a specific path (file or directory) in the folder.
- Parameters:
path (string) – path inside the folder to the file or directory
- Returns:
information about the file or folder at path, as a dict. Fields are:
exists : whether there is a file or folder at path
directory : True if path denotes a directory in the managed folder, False if it’s a file
fullPath : path inside the folder
size : if a file, the size in bytes of the file
lastModified : last modification time of the file or directory at path, in milliseconds since epoch
mimeType : for files, the detected MIME type
children : for directories, a list of the contents of the directory, each element having the present structure (not recursive)
- Return type:
dict
- get_download_stream(path)#
Get a file-like object that allows you to read a single file from this folder.
Usage example:
with folder.get_download_stream("myfile") as stream: data = stream.readline() print("First line of myfile is: {}".format(data))
Note
The file-like returned by this method is not seekable.
- Parameters:
path (string) – path inside the managed folder
- Returns:
the data of the file at path inside the managed folder
- Return type:
file-like
- upload_stream(path, f)#
Upload the content of a file-like object to a specific path in the managed folder.
If the file already exists, it will be replaced.
# This copies a local file to the managed folder with open("local_file_to_upload") as f: folder.upload_stream("name_of_file_in_folder", f)
- Parameters:
path (string) – Target path of the file to write in the managed folder
f (stream) – file-like object open for reading
- upload_file(path, file_path)#
Upload a local file to a specific path in the managed folder.
If the file already exists, it will be replaced.
- Parameters:
path (string) – Target path of the file to write in the managed folder
file_path (string) – Absolute path to a local file
- upload_data(path, data)#
Upload binary data to a specific path in the managed folder.
If the file already exists, it will be replaced.
- Parameters:
path (string) – Target path of the file to write in the managed folder
data (bytes) – str or unicode data to upload
- get_writer(path)#
Get a writer object to write incrementally to a specific path in the managed folder.
If the file already exists, it will be replaced.
- Parameters:
path (string) – Target path of the file to write in the managed folder
- Return type:
dataiku.core.managed_folder.ManagedFolderWriter
- get_last_metric_values(partition='')#
Get the set of last values of the metrics on this folder.
- Parameters:
partition (string) – (optional) a partition identifier to get metrics for. If not set, returns the metrics of a non-partitioned folder, and the metrics of the whole managed folder for a partitioned managed folder
- Return type:
- get_metric_history(metric_lookup, partition='')#
Get the set of all values a given metric took on this folder.
- Parameters:
metric_lookup (string) – metric name or unique identifier
partition (string) – (optional) a partition identifier to get metrics for. If not set, returns the metrics of a non-partitioned folder, and the metrics of the whole managed folder for a partitioned managed folder
- Returns:
an object containing the values of the metric_lookup metric, cast to the appropriate type (double, boolean, …). Top-level fields are:
metricId : identifier of the metric
metric : dict of the metric’s definition
valueType : type of the metric values in the values array
lastValue : most recent value, as a dict of:
time : timestamp of the value computation
value : value of the metric at time
values : list of values, each one a dict of the same structure as lastValue
- Return type:
dict
- save_external_metric_values(values_dict, partition='')#
Save metrics on this folder. The metrics are saved with the type “external”.
- Parameters:
values_dict (dict) – the values to save, as a dict. The keys of the dict are used as metric names
partition (string) – (optional), the partition for which to fetch the values. On partitioned folders, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
- get_last_check_values(partition='')#
Get the set of last values of the checks on this folder, as a
dataiku.core.metrics.ComputedChecks
object- Parameters:
partition (string) – (optional), the partition for which to fetch the values. On partitioned folders, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
- Return type:
- save_external_check_values(values_dict, partition='')#
Save checks on this folder. The checks are saved with the type “external”.
- Parameters:
values_dict (dict) – the values to save, as a dict. The keys of the dict are used as check names
partition (string) – (optional), the partition for which to fetch the values. On partitioned folders, the partition value to use for accessing metrics on the whole dataset (ie. all partitions) is ALL
dataikuapi package#
Use this class preferably outside of DSS
- class dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, odb_id)#
A handle to interact with a managed folder on the DSS instance.
Important
Do not create this class directly, instead use
dataikuapi.dss.project.get_managed_folder()
- property id#
Returns the internal identifier of the managed folder, which is a 8-character random string, not to be confused with the managed folder’s name.
- Return type:
string
- delete()#
Delete the managed folder from the flow, and objects using it (recipes or labeling tasks)
Attention
This call doesn’t delete the managed folder’s contents
- get_definition()#
Get the definition of this managed folder. The definition contains name, description checklists, tags, connection and path parameters, metrics and checks setup.
Caution
Deprecated. Please use
get_settings()
- Returns:
the managed folder definition.
- Return type:
dict
- set_definition(definition)#
Set the definition of this managed folder.
Caution
Deprecated. Please use
get_settings()
thensave()
Note
the fields id and projectKey can’t be modified
Usage example:
folder_definition = folder.get_definition() folder_definition['tags'] = ['tag1','tag2'] folder.set_definition(folder_definition)
- Parameters:
definition (dict) – the new state of the definition for the folder. You should only set a definition object that has been retrieved using the
get_definition()
call- Returns:
a message upon successful completion of the definition update. Only contains one msg field
- Return type:
dict
- get_settings()#
Returns the settings of this managed folder as a
DSSManagedFolderSettings
.You must use
save()
on the returned object to make your changes effective on the managed folder.# Example: activating discrete partitioning folder = project.get_managed_folder("my_folder_id") settings = folder.get_settings() settings.add_discrete_partitioning_dimension("country") settings.save()
- Returns:
the settings of the managed folder
- Return type:
- list_contents()#
Get the list of files in the managed folder
Usage example:
for content in folder.list_contents()['items']: last_modified_seconds = content["lastModified"] / 1000 last_modified_str = datetime.fromtimestamp(last_modified_seconds).strftime("%Y-%m-%d %H:%m:%S") print("size=%s mtime=%s %s" % (content["size"], last_modified_str, content["path"]))
- Returns:
the list of files, in the items field. Each item has fields:
path : path of the file inside the folder
size : size of the file in bytes
lastModified : last modification time, in milliseconds since epoch
- Return type:
dict
- get_file(path)#
Get a file from the managed folder
Usage example:
with folder.get_file("/kaggle_titanic_train.csv") as fd: df = pandas.read_csv(fd.raw)
- Parameters:
path (string) – the path of the file to read within the folder
- Returns:
the HTTP request to stream the data from
- Return type:
requests.models.Response
- delete_file(path)#
Delete a file from the managed folder
- Parameters:
path (string) – the path of the file to read within the folder
Note
No error is raised if the file doesn’t exist
- put_file(path, f)#
Upload the file to the managed folder. If the file already exists in the folder, it is overwritten.
Usage example:
with open("./some_local.csv") as fd: uploaded = folder.put_file("/target.csv", fd).json() print("Uploaded %s bytes" % uploaded["size"])
- Parameters:
path (string) – the path of the file to write within the folder
f (file) – a file-like
Note
if using a string for the f parameter, the string itself is taken as the file content to upload
- Returns:
information on the file uploaded to the folder, as a dict of:
path : path of the file inside the folder
size : size of the file in bytes
lastModified : last modification time, in milliseconds since epoch
- Return type:
dict
- upload_folder(path, folder)#
Upload the content of a folder to a managed folder.
Note
upload_folder(“/some/target”, “./a/source/”) will result in “target” containing the contents of “source”, but not the “source” folder being a child of “target”
- Parameters:
path (str) – the destination path of the folder in the managed folder
folder (str) – local path (absolute or relative) of the source folder to upload
- compute_metrics(metric_ids=None, probes=None)#
Compute metrics on this managed folder.
Usage example:
future_resp = folder.compute_metrics() future = DSSFuture(client, future_resp.get("jobId", None), future_resp) metrics = future.wait_for_result() print("Computed in %s ms" % (metrics["endTime"] - metrics["startTime"])) for computed in metrics["computed"]: print("Metric %s = %s" % (computed["metricId"], computed["value"]))
- Parameters:
metric_ids (list[string]) – (optional) identifiers of metrics to compute, among the metrics defined on the folder
probes (dict) – (optional) definition of metrics probes to use, in place of the ones defined on the folder. The current set of probes on the folder is the probes field in the dict returned by
get_definition()
- Returns:
a future as dict representing the task of computing the probes
- Return type:
dict
- get_last_metric_values()#
Get the last values of the metrics on this managed folder.
- Returns:
a handle on the values of the metrics
- Return type:
- get_metric_history(metric)#
Get the history of the values of a metric on this managed folder.
Usage example:
history = folder.get_metric_history("basic:COUNT_FILES") for value in history["values"]: time_str = datetime.fromtimestamp(value["time"] / 1000).strftime("%Y-%m-%d %H:%m:%S") print("%s : %s" % (time_str, value["value"]))
- Parameters:
metric (string) – identifier of the metric to get values of
- Returns:
an object containing the values of the metric, cast to the appropriate type (double, boolean,…). The identifier of the metric is in a metricId field.
- Return type:
dict
- get_zone()#
Get the flow zone of this managed folder.
- Returns:
a flow zone
- Return type:
- move_to_zone(zone)#
Move this object to a flow zone.
- Parameters:
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to move the object, or its identifier
Share this object to a flow zone.
- Parameters:
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to share the object, or its identifier
Unshare this object from a flow zone.
- Parameters:
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
from where to unshare the object, or its identifier
- get_usages()#
Get the recipes referencing this folder.
Usage example:
for usage in folder.get_usages(): if usage["type"] == 'RECIPE_INPUT': print("Used as input of %s" % usage["objectId"])
- Returns:
a list of usages, each one a dict of:
type : the type of usage, either “RECIPE_INPUT” or “RECIPE_OUTPUT”
objectId : name of the recipe
objectProjectKey : project of the recipe
- Return type:
list[dict]
- get_object_discussions()#
Get a handle to manage discussions on the managed folder.
- Returns:
the handle to manage discussions
- Return type:
- copy_to(target, write_mode='OVERWRITE')#
Copy the data of this folder to another folder.
- Parameters:
target (object) – a
dataikuapi.dss.managedfolder.DSSManagedFolder
representing the target location of this copy- Returns:
a DSSFuture representing the operation
- Return type:
- create_dataset_from_files(dataset_name)#
Create a new dataset of type ‘FilesInFolder’, taking its files from this managed folder, and return a handle to interact with it.
The created dataset does not have its format and schema initialized, it is recommended to use
autodetect_settings()
on the returned object- Parameters:
dataset_name (str) – the name of the dataset to create. Must not already exist
- Returns:
A dataset handle
- Return type:
- class dataikuapi.dss.managedfolder.DSSManagedFolderSettings(folder, settings)#
Base settings class for a DSS managed folder. Do not instantiate this class directly, use
DSSDSSManagedFolderSettings.get_settings()
Use
save()
to save your changes- get_raw()#
Get the managed folder settings.
- Returns:
the settings, as a dict. The definition of the actual location of the files in the managed folder is a params sub-dict.
- Return type:
dict
- get_raw_params()#
Get the type-specific (S3/ filesystem/ HDFS/ …) params as a dict.
- Returns:
the type-specific patams. Each type defines a set of fields; commonly found fields are :
connection : name of the connection used by the managed folder
path : root of the managed folder within the connection
bucket or container : the bucket/container name on cloud storages
- Return type:
dict
- property type#
Get the type of filesystem that the managed folder uses.
- Return type:
string
- save()#
Save the changes to the settings on the managed folder.
Usage example:
folder = project.get_managed_folder("my_folder_id") settings = folder.get_settings() settings.set_connection_and_path("some_S3_connection", None) settings.get_raw_params()["bucket"] = "some_S3_bucket" settings.save()
- remove_partitioning()#
Make the managed folder non-partitioned.
- add_discrete_partitioning_dimension(dim_name)#
Add a discrete partitioning dimension.
- Parameters:
dim_name (string) – name of the partitioning dimension
- add_time_partitioning_dimension(dim_name, period='DAY')#
Add a time partitioning dimension.
- Parameters:
dim_name (string) – name of the partitioning dimension
period (string) – granularity of the partitioning dimension (YEAR, MONTH, DAY (default), HOUR)
- set_partitioning_file_pattern(pattern)#
Set the partitioning pattern of the folder. The pattern indicates which paths inside the folder belong to which partition. Partition dimensions are written with:
%{dim_name} for discrete dimensions
%Y (=year), %M (=month), %D (=day) and %H (=hour) for time dimensions
Besides the %… variables for injecting the partition dimensions, the pattern is a regular expression.
Usage example:
# partition a managed folder by month folder = project.get_managed_folder("my_folder_id") settings = folder.get_settings() settings.add_time_partitioning_dimension("my_date", "MONTH") settings.set_partitioning_file_pattern("/year=%Y/month=%M/.*") settings.save()
- Parameters:
pattern (string) – the partitioning pattern
- set_connection_and_path(connection, path)#
Change the managed folder connection and/or path.
Note
When changing the connection or path, the folder’s files aren’t moved or copied to the new location
Attention
When changing the connection for a connection with a different type, for example going from a S3 connection to an Azure Blob Storage connection, only the managed folder type is changed. Type-specific fields are not converted. In the example of a S3 to Azure conversion, the S3 bucket isn’t converted to a storage account container.
- Parameters:
connection (string) – the name of a file-based connection. If None, the connection of the managed folder is left unchanged
path (string) – a path relative to the connection root. If None, the path of the managed folder is left unchanged