Load and re-use a spaCy named-entity recognition model#

Prerequisites#

Introduction#

Named-entity recognition (NER) is concerned with locating and classifying named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations etc. The training of a NER model might be costly. Fortunately, you could rely on pre-trained models to perform that recognition task.

In this tutorial, you will use Dataiku’s Code Environment resources to create a code environment with a spaCy pre-trained NER model.

Loading the pre-trained NER model#

After creating your Python code environment with the required spaCy package (see beginning of tutorial), you will download the required assets for your pre-trained model. To do so, in the Resources screen of your Code Environment, input the following initialization script then click on Update:

## Base imports
from dataiku.code_env_resources import clear_all_env_vars

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## SpaCy
# Import SpaCy
import spacy

# Download model: automatically managed by spacy, installs the model
# spacy pipeline as a Python package.
spacy.cli.download("en_core_web_sm")

This script will download the spaCy English pipeline en_core_web_sm and store it on the Dataiku Instance. This pipeline contains the pre-trained NER model, among other NLP tools.

Note that the script will only need to run once. After that all users allowed to use the Code Environment will be able to leverage the NER model without having to re-download it.

Performing NER using your pre-trained model#

You can now use your pre-trained model in your Dataiku Project’s Python Recipe or notebook to perform NER on some text. Here is an example:

import spacy

nlp = spacy.load('en_core_web_sm')
text = nlp(""""
A call for American independence from Britain,
the Virginia Declaration of Rights was drafted
by George Mason in May 1776""")

for word in text.ents:
    print(f"{word.text} --> {word.label_}")

Running this code should give you an output similar to this:

American --> NORP
Britain --> GPE
the Virginia Declaration of Rights --> ORG
George Mason --> PERSON
May 1776 --> DATE