Using external libraries for projects#
Prerequisites#
Dataiku >= 14
Access to a Dataiku instance with a personal API Key
- Access to an existing project with the following permissions:
“Read project content”
“Write project content”
A GitHub account with a public SSH key. You need this to download a Python file from the Dataiku Academy Samples repository using SSH.
Note
Visit GitHub Docs to learn how to sign up for a GitHub account. For more information about adding a public SSH key to your account, visit GitHub Docs: Connecting to GitHub with SSH.
Introduction#
Developers benefit from collective knowledge when they code with others developing projects on the same Dataiku instance. One of the most common ways to access and share code in Dataiku is through project libraries. When the code you need is available in a Git repository, you can import it into your project library and share that library with other projects for maximum reusability. To learn more about Git and Dataiku, visit Working with Git and Importing code from Git in project libraries.
Contrary to Global shared python code, importing a library directly from Git allows for efficient versioning and conflict resolution. It also ensures that each project has its dedicated allocated resources and is not flooded with irrelevant libraries.
In this tutorial, you will create a single project library to share among projects.
You will use the Shared Code starter projects: Project A and Project B as a starting template. Both projects can be created by clicking the New project button, selecting Learning projects, and choosing Project A or B from the Developer subtype. If you prefer, you can download them by clicking on the provided links and manually installing them.
Note
All steps in this tutorial that require code and API usage can be performed within Dataiku using a notebook or directly with the UI. For the latest, please see Concept | Introduction to shared code. For clarity and clarification, most of the steps and code will assume you are interacting with the platform from an external IDE.
Creating a code library in Project A#
In this section, you will add the monthly_total_transactions Python function to our shared code library in Project A by cloning a library
from the Dataiku Academy samples repository.
This function applies a group-by to a dataset and returns a new one.
The first thing to do is to connect to your instance and get the project.
Note
This repository is public, hence the fetch is made with HTTPS.
import dataiku
DSS_HOST = "" # Fill in your DSS instance's URL
API_KEY = "" # Fill in your personal API key to access that instance
# Connect to your instance
dataiku.set_remote_dss(f"http://{DSS_HOST}", API_KEY)
client = dataiku.api_client()
# Work with the provided Project A
PROJECT_KEY_A = "DKU_TUT_SHARECODE_A" # Replace with the correct project key
project_a = client.get_project(PROJECT_KEY_A)
project_git = project_a.get_project_git()
# Add an external library to those already present
REPO_URL = 'https://github.com/dataiku/academy-samples.git'
GIT_PATH = 'shared-code'
project_git.add_library(REPO_URL, 'python', 'main', path_in_git_repository=GIT_PATH, as_type='object')
Note that you’ll need to force a reload to get the new library if you are working on a Jupyter notebook.
You can then handle the project dataset and apply the new function.
# import useful data manipulation packages
import pandas as pd, numpy as np
# import new library
from total_transactions import monthly_total_transactions
# Read recipe inputs
ecommerce_transactions = dataiku.Dataset("ecommerce_transactions")
ecommerce_transactions_df = ecommerce_transactions.get_dataframe()
# Apply new functions to our dataset
monthly_transactions_df = monthly_total_transactions(ecommerce_transactions_df)
Wrapping up#
This example demonstrates the procedure for sharing code across multiple projects using APIs. It underscores Dataiku’s capability to facilitate programmatic interactions and collaboration among various projects, significantly enhancing the reusability and scalability of code within the platform.
Complete code#
Reference documentation#
Classes#
|
Entry point for the DSS API client |
|
Handle to manage the git repository of a DSS project (fetch, push, pull, ...) |
|
A handle to interact with a project on the DSS instance. |
A handle to manage the library of a project It saves locally a copy of taxonomy to help navigate in the library All modifications done through this object and related library items are done locally and on remote. |
Functions#
|
Add a new external library to the project and pull it. |
|
Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe. |
|
Retrieves a file in the library |
Get a handle to manage the project library |
|
|
Get a handle to interact with a specific project. |
Gets an handle to perform operations on the project's git repository. |
