Load and re-use an NLTK tokenizer#
Natural Language Toolkit (NLTK) is a Python package to execute a variety of operations on text data. It relies on several pre-trained artifacts like word embeddings or tokenizers that are not available out-of-the-box when you install the package: by default you have to manually download them in your code.
Loading the tokenizer#
After creating your Python code environment with the required NLTK package (see the beginning of the tutorial), in the Resources screen of your Code Environment, input the following initialization script then click on Update:
## Base imports import os from dataiku.code_env_resources import clear_all_env_vars from dataiku.code_env_resources import set_env_path # Clears all environment variables defined by previously run script clear_all_env_vars() ## NLTK # Set NLTK data directory set_env_path("NLTK_DATA", "nltk_data") # Import NLTK import nltk # Download model: automatically managed by NLTK, does not download # anything if model is already in NLTK_DATA. nltk.download('punkt', download_dir=os.environ["NLTK_DATA"])
This script will download the
punkt tokenizer and store it on the Dataiku instance.
Note that the script will only need to run once. Once run successfully, all users allowed to use the code environment will be able to leverage the tokenizer without having to re-download it.
Using the tokenizer in your code#
You can now use your tokenizer in your Dataiku project’s Python recipe or notebook. Here is an example:
import nltk text = ''' Dataiku integrates with your existing infrastructure — on-premises or in the cloud. It takes advantage of each technology’s native storage and computational layers. Additionally, Dataiku provides a fully hosted SaaS option built for the modern cloud data stack. With fully managed elastic AI powered by Spark and Kubernetes, you can achieve maximum performance and efficiency on large workloads. ''' sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') print('\n-----\n'.join(sent_detector.tokenize(text.replace('\n', ' ').strip())))
Running this code should give you an output similar to this:
Dataiku integrates with your existing infrastructure — on-premises or in the cloud. ----- It takes advantage of each technology’s native storage and computational layers. ----- Additionally, Dataiku provides a fully hosted SaaS option built for the modern cloud data stack. ----- With fully managed elastic AI powered by Spark and Kubernetes, you can achieve maximum performance and efficiency on large workloads.
Using the same process, you can easily fetch and reuse any other kind of artifact required by NLTK for your text-processing tasks.