Load and re-use a Hugging Face model#

Prerequisites#

Introduction#

Machine learning use cases can involve a lot of input data and compute-heavy thus expensive model training. It is common to download pre-trained models from remote repositories and use them instead. Hugging Face hosts a well-known one with models for image and text processing. In this tutorial, you will use Dataiku’s Code Environment resources feature to download and save a pre-trained text classification model from Hugging Face. You will then re-use that model to predict a masked string in a sentence.

Downloading the pre-trained model#

The first step is to download the required assets for your pre-trained model. To do so, in the Resources screen of your Code Environment, input the following initialization script then click on Update:

## Base imports
import os

from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import grant_permissions
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")
set_env_path("TRANSFORMERS_CACHE", "huggingface/transformers")
hf_home_dir = os.getenv("HF_HOME")
transformers_home_dir = os.getenv("TRANSFORMERS_CACHE")

# Import Hugging Face's transformers
import transformers

# Download pre-trained models
model_name = "distilbert-base-uncased"
MODEL_REVISION = "1c4513b2eedbda136f57676a34eea67aba266e5c"
model = transformers.DistilBertModel.from_pretrained(model_name, revision=MODEL_REVISION)
unmasker = transformers.DistilBertForMaskedLM.from_pretrained(model_name, revision=MODEL_REVISION)
tokenizer = transformers.DistilBertTokenizer.from_pretrained(model_name, revision=MODEL_REVISION)

# Grant everyone read access to pre-trained models in the HF_HOME folder
# (by default, only readable by the owner)
grant_permissions(hf_home_dir)
grant_permissions(transformers_home_dir)

This script will retrieve a DistilBERT model from Hugging Face and stores it in the Dataiku Instance.

Note that it will only need to run once, after that all users allowed to use the Code Environment will be able to leverage the pre-trained model with re-downloading it again.

Using the pre-trained model for inference#

You can now re-use this pre-trained model in your Dataiku Project’s Python Recipe or notebook. Here is an example adapted from a sample in the model repository that fills the masked parts of a sentence with the appropriate word:

import os

from transformers import pipeline
from transformers import DistilBertTokenizer, DistilBertForMaskedLM


# Define which pre-trained model to use
model = {"name": "distilbert-base-uncased",
         "revision": "1c4513b2eedbda136f57676a34eea67aba266e5c"}

# Load pre-trained model
hf_home_dir = os.getenv("HF_HOME")
model_path = os.path.join(hf_home_dir,
                          f"transformers/models--{model['name']}/snapshots/{model['revision']}")
unmasker = DistilBertForMaskedLM.from_pretrained(model_path, local_files_only=True)
tokenizer = DistilBertTokenizer.from_pretrained(model_path, local_files_only=True)

# predict masked output
unmask = pipeline("fill-mask", model=unmasker, tokenizer=tokenizer)
input_sentence = "Lend me your ears and I'll sing you a [MASK]"
resp = unmask(input_sentence)
for r in resp:
    print(f"{r['sequence']} ({r['score']})")

Running this code should give you an output similar to this:

lend me your ears and i'll sing you a lullaby (0.29883989691734314)
lend me your ears and i'll sing you a tune (0.10296259075403214)
lend me your ears and i'll sing you a song (0.10061296075582504)
lend me your ears and i'll sing you a hymn (0.09704853594303131)
lend me your ears and i'll sing you a cappella (0.034581124782562256)

Wrapping up#

Your pre-trained model is now operational! From there you can easily reuse it, e.g. to process multiple text records stored in a Managed Folder or within a text column of a Dataset.