Load and re-use a Hugging Face model#
Machine learning use cases can involve a lot of input data and compute-heavy thus expensive model training. It is common to download pre-trained models from remote repositories and use them instead. Hugging Face hosts a well-known one with models ranging from text generation to image embedding passing by reasonning. In this tutorial you will see how you can leverage Dataiku functionnality to download and save a pre-trained text classification model. You will then re-use that model to predict a masked string in a sentence.
Loading a model leveraging the model cache (recommended)#
In this section, you will use Dataiku’s Model Cache to download, save, and retrieve your Hugging model.
Downloading the pre-trained model#
The first step is to download the required assets for your pre-trained model. Run this snippet of code anywhere in DSS (for example in a Python recipe, in a Notebook, or even in an initialization script of a code environment):
from dataiku.core.model_provider import download_model_to_cache
download_model_to_cache("distilbert/distilbert-base-uncased")
This script retrieves a DistilBERT model from Hugging Face and stores it in the Dataiku Instance model cache.
Note that the model must have been enabled by your Dataiku admin in an HuggingFace connection. For gated models, you must pass a connection name to use the HuggingFace authentication token.
Using the pre-trained model for inference#
Prerequisites#
Dataiku >= 14.2.0
Python >= 3.9
A Code Environment with the following packages:
transformers # tested with 4.51.3 torch # tested with 2.7.0
You can now re-use this pre-trained model in your Dataiku Project’s Python Recipe or notebook. Here is an example adapted from a sample in the model repository that fills the masked parts of a sentence with the appropriate word:
import os
from transformers import pipeline
from dataiku.core.model_provider import get_model_from_cache
model = get_model_from_cache("distilbert/distilbert-base-uncased")
# predict masked output
unmask = pipeline("fill-mask", model=model)
input_sentence = "Lend me your ears and I'll sing you a [MASK]"
resp = unmask(input_sentence)
for r in resp:
print(f"{r['sequence']} ({r['score']})")
Running this code should give you an output similar to this:
lend me your ears and i ' ll sing you a lullaby (0.29884061217308044)
lend me your ears and i ' ll sing you a tune (0.10296323150396347)
lend me your ears and i ' ll sing you a song (0.10061406344175339)
lend me your ears and i ' ll sing you a hymn (0.09704922884702682)
lend me your ears and i ' ll sing you a cappella (0.034581173211336136)
Loading a model using code environment ressources#
In this section, you will use Dataiku’s Code Environment Resources feature to download and save a pre-trained text classification model from Hugging Face. Be careful when using large model in code environment ressources since the full models will be packed up with your environment which can easily represents dozens of GB.
Prerequisites#
Python >= 3.9
A Code Environment with the following packages:
transformers # tested with 4.54.1 torch # tested with 2.7.1
Downloading the pre-trained model#
The first step is to download the required assets for your pre-trained model. To do so, in the Resources screen of your Code Environment, input the following initialization script then click on Update:
## Base imports
import os
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import grant_permissions
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var
# Clears all environment variables defined by previously run script
clear_all_env_vars()
## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")
set_env_path("TRANSFORMERS_CACHE", "huggingface/transformers")
hf_home_dir = os.getenv("HF_HOME")
transformers_home_dir = os.getenv("TRANSFORMERS_CACHE")
# Import Hugging Face's transformers
import transformers
# Download pre-trained models
model_name = "distilbert-base-uncased"
MODEL_REVISION = "1c4513b2eedbda136f57676a34eea67aba266e5c"
model = transformers.DistilBertModel.from_pretrained(model_name, revision=MODEL_REVISION)
unmasker = transformers.DistilBertForMaskedLM.from_pretrained(model_name, revision=MODEL_REVISION)
tokenizer = transformers.DistilBertTokenizer.from_pretrained(model_name, revision=MODEL_REVISION)
# Grant everyone read access to pre-trained models in the HF_HOME folder
# (by default, only readable by the owner)
grant_permissions(hf_home_dir)
grant_permissions(transformers_home_dir)
This script retrieves a DistilBERT model from Hugging Face and stores it in the Dataiku Instance.
Note that it will only need to run once, after that all users allowed to use the Code Environment will be able to leverage the pre-trained model without re-downloading it again.
Using the pre-trained model for inference#
You can now re-use this pre-trained model in your Dataiku Project’s Python Recipe or notebook. Here is an example adapted from a sample in the model repository that fills the masked parts of a sentence with the appropriate word:
import os
from transformers import pipeline
from transformers import DistilBertTokenizer, DistilBertForMaskedLM
# Define which pre-trained model to use
model = {"name": "distilbert-base-uncased",
"revision": "1c4513b2eedbda136f57676a34eea67aba266e5c"}
# Load pre-trained model
hf_home_dir = os.getenv("HF_HOME")
model_path = os.path.join(hf_home_dir,
f"transformers/models--{model['name']}/snapshots/{model['revision']}")
unmasker = DistilBertForMaskedLM.from_pretrained(model_path, local_files_only=True)
tokenizer = DistilBertTokenizer.from_pretrained(model_path, local_files_only=True)
# predict masked output
unmask = pipeline("fill-mask", model=unmasker, tokenizer=tokenizer)
input_sentence = "Lend me your ears and I'll sing you a [MASK]"
resp = unmask(input_sentence)
for r in resp:
print(f"{r['sequence']} ({r['score']})")
Running this code should give you an output similar to this:
lend me your ears and i'll sing you a lullaby (0.29883989691734314)
lend me your ears and i'll sing you a tune (0.10296259075403214)
lend me your ears and i'll sing you a song (0.10061296075582504)
lend me your ears and i'll sing you a hymn (0.09704853594303131)
lend me your ears and i'll sing you a cappella (0.034581124782562256)
Wrapping up#
Your pre-trained model is now operational! From there you can easily reuse it, e.g. to process multiple text records stored in a Managed Folder or within a text column of a Dataset.
