Leveraging MOSTLY AI for Synthetic Data Generation#
Using MOSTLY AI within Dataiku, you can generate a synthetic dataset by following this tutorial.
Prerequisites#
A MOSTLY AI account with an API Key. You can sign up for a free account at MOSTLY.ai.
Python 3.9
A code environment with the following packages:
mostlyai
Note
To know more about mostlyai
and its installation, please see mostlyai
.
Who is MOSTLY AI?#
MOSTLY AI pioneered the creation of synthetic data for AI model development. Datasets generated by the MOSTLY AI platform look as real as a company’s original customer data with just as many details, but without the original personal data points – helping companies comply with privacy protection regulations such as GDPR and CCPA and ensuring that models are fair and unbiased.
Creating a Session#
First, initialize a MOSTLY AI client session. Later, you will use the session to train a generator and create a synthetic dataset.
# initialize the MOSTLY client
from mostlyai import MostlyAI
mostly = MostlyAI(
api_key='REPLACE_WITH_YOUR_MOSTLY_API_KEY',
base_url='https://app.mostly.ai'
)
Loading data into a DataFrame#
Load a dataset to train a generator model later. You can use any dataset that you have available in your Dataiku project. Here, we use a US Census Income dataset.
# load a Dataiku dataset as a Pandas DataFrame
us_census_income = dataiku.Dataset("us_census_income")
us_census_income_df = mydataset.get_dataframe()
Training a Generator#
We train a generator on our dataset, which is used in the next step to generate synthetic data.
# train a generator - for model creators
g = mostly.train(data=us_census_income_df, start=True, wait=True)
Generating a Synthetic Dataset#
We use the generator trained in the previous step to create a synthetic dataset.
# use generator to create a synthetic dataset - for data consumers
sd = mostly.generate(g, size=200)
synth_df = sd.data()
Writing a DataFrame to Dataiku dataset#
Lastly, write the DataFrame to a dataset to be used within your Dataiku Flow.
# write the synthetic df to a Dataiku dataset
output = dataiku.Dataset("synth_census_data")
output.write_with_schema(synth_df)