Leveraging MOSTLY AI for Synthetic Data Generation#

Using MOSTLY AI within Dataiku, you can generate a synthetic dataset by following this tutorial.

Prerequisites#

  • A MOSTLY AI account with an API Key. You can sign up for a free account at MOSTLY.ai.

  • Python 3.9

  • A code environment with the following packages:

    mostlyai
    

Note

To know more about mostlyai and its installation, please see mostlyai.

Who is MOSTLY AI?#

MOSTLY AI pioneered the creation of synthetic data for AI model development. Datasets generated by the MOSTLY AI platform look as real as a company’s original customer data with just as many details, but without the original personal data points – helping companies comply with privacy protection regulations such as GDPR and CCPA and ensuring that models are fair and unbiased.

Creating a Session#

First, initialize a MOSTLY AI client session. Later, you will use the session to train a generator and create a synthetic dataset.

# initialize the MOSTLY client
from mostlyai import MostlyAI

mostly = MostlyAI(
    api_key='REPLACE_WITH_YOUR_MOSTLY_API_KEY', 
    base_url='https://app.mostly.ai'
)

Loading data into a DataFrame#

Load a dataset to train a generator model later. You can use any dataset that you have available in your Dataiku project. Here, we use a US Census Income dataset.

# load a Dataiku dataset as a Pandas DataFrame
us_census_income = dataiku.Dataset("us_census_income")
us_census_income_df = mydataset.get_dataframe()

Training a Generator#

We train a generator on our dataset, which is used in the next step to generate synthetic data.

# train a generator - for model creators
g = mostly.train(data=us_census_income_df, start=True, wait=True)

Generating a Synthetic Dataset#

We use the generator trained in the previous step to create a synthetic dataset.

# use generator to create a synthetic dataset - for data consumers
sd = mostly.generate(g, size=200)
synth_df = sd.data()

Writing a DataFrame to Dataiku dataset#

Lastly, write the DataFrame to a dataset to be used within your Dataiku Flow.

# write the synthetic df to a Dataiku dataset
output = dataiku.Dataset("synth_census_data")
output.write_with_schema(synth_df)