Load and re-use a SentenceTransformers word embedding model#

Prerequisites#

Dataiku version >= 10.0.0.
A Python>=3.9 Code Environment with the following package:
- sentence-transformers==2.2.2

Introduction#

Natural Language Processing (NLP) use cases typically involve converting text to word embeddings. Training your word embeddings on large corpora of texts is costly. As a result, downloading pre-trained word embeddings models and re-training them as needed is a popular option. SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The framework is based on Pytorch and Transformers and offers a large collection of pre-trained models. In this tutorial, you will use Dataiku’s Code Environment resources feature to download and save pre-trained word embedding models from SentenceTransformers. You will then use one of those models to map a few sentences to embeddings.

Downloading the pre-trained word embedding model#

The first step is to download the required assets for your pre-trained models. To do so, in the Resources screen of your Code Environment, input the following initialization script then click on Update:

######################## Base imports #################################
import logging
import os
import shutil

from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import grant_permissions
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var
from dataiku.code_env_resources import update_models_meta

# Set-up logging
logging.basicConfig()
logger = logging.getLogger("code_env_resources")
logger.setLevel(logging.INFO)

# Clear all environment variables defined by a previously run script
clear_all_env_vars()

# Optionally restrict the GPUs this code environment can use (it can use all by default)
# set_env_var("CUDA_VISIBLE_DEVICES", "") # Hide all GPUs
# set_env_var("CUDA_VISIBLE_DEVICES", "0") # Allow only cuda:0
# set_env_var("CUDA_VISIBLE_DEVICES", "0,1") # Allow only cuda:0 & cuda:1

######################## Sentence Transformers #################################
# Set sentence_transformers cache directory
set_env_path("SENTENCE_TRANSFORMERS_HOME", "sentence_transformers")

import sentence_transformers

# Download pretrained models
MODELS_REPO_AND_REVISION = [
    ("DataikuNLP/average_word_embeddings_glove.6B.300d", "52d892b217016f53b6c717839bf62c746a658933"), 
    # Add other models you wish to download and make available as shown below (removing the # to uncomment):
    # ("bert-base-uncased", "b96743c503420c0858ad23fca994e670844c6c05"),
]

sentence_transformers_cache_dir = os.getenv("SENTENCE_TRANSFORMERS_HOME")
for (model_repo, revision) in MODELS_REPO_AND_REVISION:
    logger.info("Loading pretrained SentenceTransformer model: {}".format(model_repo))
    model_path = os.path.join(sentence_transformers_cache_dir, model_repo.replace("/", "_"))


    # This also skips same models with a different revision
    if not os.path.exists(model_path):
        model_path_tmp = sentence_transformers.util.snapshot_download(
            repo_id=model_repo,
            revision=revision,
            cache_dir=sentence_transformers_cache_dir,
            library_name="sentence-transformers",
            library_version=sentence_transformers.__version__,
            ignore_files=["flax_model.msgpack", "rust_model.ot", "tf_model.h5",],
        )
        os.rename(model_path_tmp, model_path)
    else:
        logger.info("Model already downloaded, skipping")
# Add text embedding models to the code-envs models meta-data
# (ensure that they are properly displayed in the feature handling)
update_models_meta()
# Grant everyone read access to pretrained models in sentence_transformers/ folder
# (by default, sentence transformers makes them only readable by the owner)
grant_permissions(sentence_transformers_cache_dir)

This script retrieves a pre-trained model from SentenceTransformers and stores them in the Dataiku Instance. To download more of them, you’ll need to add them to the list and includes their revision, which is the model repository’s way of versioning these models.

Note that the script will only need to run once. After that, all users allowed to use the Code Environment will be able to leverage the pre-trained models without having to re-download them.

Converting sentences to embeddings using your pre-trained model#

You can now use those pre-trained models in your Dataiku Project’s Python Recipe or notebook. Here is an example using the word average_word_embeddings_glove.6B.300d model to map each sentence in a list to a 300-dimensional dense vector space.

import os
from sentence_transformers import SentenceTransformer

# Load pre-trained model
sentence_transformer_home = os.getenv('SENTENCE_TRANSFORMERS_HOME')
model_path = os.path.join(sentence_transformer_home, 'DataikuNLP_average_word_embeddings_glove.6B.300d')
model = SentenceTransformer(model_path)

sentences = ["I really like Ice cream", "Brussels sprouts are okay too"]

# get sentences embeddings
embeddings = model.encode(sentences)
embeddings.shape

Running this code should output a numpy array of shape (2,300) containing numerical values.