Load and re-use an NLTK tokenizer#

Prerequisites#

Introduction#

Natural Language Toolkit (NLTK) is a Python package to execute a variety of operations on text data. It relies on several pre-trained artifacts like word embeddings or tokenizers that are not available out-of-the-box when you install the package: by default you have to manually download them in your code.

In this tutorial, you will use Dataiku’s code environment resources to create a code environment and add the punkt sentence tokenizer to it.

Loading the tokenizer#

After creating your Python code environment with the required NLTK package (see the beginning of the tutorial), in the Resources screen of your Code Environment, input the following initialization script then click on Update:

## Base imports
import os
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## NLTK
# Set NLTK data directory
set_env_path("NLTK_DATA", "nltk_data")

# Import NLTK
import nltk

# Download model: automatically managed by NLTK, does not download
# anything if model is already in NLTK_DATA.
nltk.download('punkt', download_dir=os.environ["NLTK_DATA"])

This script will download the punkt tokenizer and store it on the Dataiku instance. Note that the script will only need to run once. Once run successfully, all users allowed to use the code environment will be able to leverage the tokenizer without having to re-download it.

Using the tokenizer in your code#

You can now use your tokenizer in your Dataiku project’s Python recipe or notebook. Here is an example:

import nltk

text = '''
Dataiku integrates with your existing infrastructure — on-premises or in the cloud. It takes advantage of 
each technology’s native storage and computational layers. Additionally, Dataiku provides 
a fully hosted SaaS option built for the modern cloud data stack. With fully 
managed elastic AI powered by Spark and Kubernetes, you can achieve maximum performance 
and efficiency on large workloads.
'''

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize(text.replace('\n', ' ').strip())))

Running this code should give you an output similar to this:

Dataiku integrates with your existing infrastructure — on-premises or in the cloud.
-----
It takes advantage of  each technology’s native storage and computational layers.
-----
Additionally, Dataiku provides  a fully hosted SaaS option built for the modern cloud data stack.
-----
With fully  managed elastic AI powered by Spark and Kubernetes, you can achieve maximum performance  and efficiency on large workloads.

Using the same process, you can easily fetch and reuse any other kind of artifact required by NLTK for your text-processing tasks.