GPT-based zero-shot text classification with the OpenAI API#

Tip

As of version 12.3, Dataiku’s LLM mesh features enhance the user experience by providing oversight, governance and centralization of LLM-powered capabilities. Please refer to this tutorial for an LLM-mesh-oriented example of a zero-shot classification problem.

The OpenAI API offers powerful tools that enable data scientists to integrate cutting-edge natural language processing (NLP) capabilities into their applications. In particular, it exposes its latest large language models (LLM) from the GPT family to be easily queried. Combining these tools with the coder-oriented features of Dataiku further empowers the platform users to configure and run NLP tasks in a project.

In this tutorial, you will cover the basics of using the OpenAI API within Dataiku and apply it using a GPT model for a text classification problem on movie reviews.

Prerequisites#

Dataiku >= 11.4
“Use” permission on a code environment using Python >= 3.9 with the following packages:
- openai (tested with version 0.27.7)
Access to an existing project with the following permissions:
- “Read project content”
- “Write project content”
A valid OpenAI secret key

Setting up the OpenAI API client#

The code environment you will be using comes with the official Python client for the OpenAI API. This package provides a convenient set of helpers on top of OpenAI’s REST APIs, avoiding the need for writing low-level HTTP calls.

Authentication#

The OpenAI API requires authentication through a secret key that you can retrieve here once logged in to your account. Now that you have written down/copied the key, it is time to configure your environment to use it.

Instead of using environment variables, as explained in OpenAI’s documentation, you will rely on a valuable feature of Dataiku called user secrets. You can set it up with the public API by following the instructions here and read more about it in the reference documentation.

Configuration & initial tests#

In this part, you will populate your project library with essential building blocks, making your code modular and easily reusable.

Go to your project library, and under python/, create a new directory called gpt_utils. Inside that directory, create two files:

__init__.py that should be left empty
auth.py that will fetch the OpenAI API secret key from the user secrets:

auth.py#

import dataiku

def get_api_key(secret_name: str = "openai_api_key") -> str:
    
    client = dataiku.api_client()
    auth_info = client.get_auth_info(with_secrets=True)
    secret_value = None
    for secret in auth_info["secrets"]:
            if secret["key"] == secret_name:
                    secret_value = secret["value"]
                    break
    if not secret_value:
            raise Exception("OpenAI secret key not found")
    else:
           return secret_value

You can move on to the fun part: calling the GPT model! In practice, you will use OpenAI’s Chat Completions API which will allow you to design advanced prompts with the latest available models. In short, your input will be made of a list of messages, each with a specific type:

system messages define how the model should behave
user messages contain the requests or comments provided by the user to which the model should respond

From the gpt_utils directory, create a new file called chat.py with the following code:

chat.py#

DEFAULT_TEMPERATURE = 0
DEFAULT_MAX_TOKENS = 500
DEFAULT_MODEL = "gpt-3.5-turbo"

import openai
import json
from typing import List, Dict
from .auth import get_api_key

openai.api_key = get_api_key("openai_api_key")

def send_prompt(prompt: str, model: str = DEFAULT_MODEL) -> str:
    messages = [
        {"role": "user", "content": prompt}
    ]
    resp = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=DEFAULT_TEMPERATURE,
        max_tokens=DEFAULT_MAX_TOKENS
    )
    answer = resp.choices[0].message["content"]
    return answer

This first simple function will allow you to ask questions to the GPT model as if it were a generic assistant. You can try it in a notebook:

from gpt_utils.chat import send_prompt
question = "When was the movie Citizen Kane released?"
answer = send_prompt(question)
print(answer)
# > 'The movie Citizen Kane was released on September 5, 1941.'

To be more flexible on the model input, you can tweak the send_prompt() function to customize the messages to include. In practice, it translates into providing additional context to the model about what it should know and how it should respond. To do so, in chat.py add the following function:

chat.py#

def send_prompt_with_context(messages: List[Dict], model: str = DEFAULT_MODEL) -> str:
    resp = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=DEFAULT_TEMPERATURE,
        max_tokens=DEFAULT_MAX_TOKENS
    )
    answer = resp.choices[0].message["content"]
    return answer

You can try a variant of the previous question in your notebook by making it a bit more original:

from gpt_utils.chat import send_prompt_with_context

question = "When was the movie Citizen Kane released?"

system_msg = """You are an expert in the history of American cinema.
You always answer questions with a lot of passion and enthusiasm.
"""
messages = [
    {"role": "system", "content": system_msg},
    {"role": "user", "content": f"Question: {question}"}
]
other_answer = send_prompt_with_context(messages)
print(other_answer)
# > "Oh, Citizen Kane! What a masterpiece! It was released on September 5, 1941. Directed by Orson Welles, 
# it is widely considered one of the greatest films ever made and a landmark achievement in American cinema. 
# The film's innovative storytelling techniques, stunning cinematography, and powerful performances have 
# influenced countless filmmakers and continue to captivate audiences to this day."

This example, while being fun, also unveils the potential of such models: with the proper instructions and context, they can perform a wide variety of tasks based on natural language! In the next section, you will use this versatility to customize your prompt and turn the GPT model into a text classifier.

Classifying movie reviews#

The following example will rely on an extract from the Large Movie Review Dataset. Download the file here and use it to create a dataset in your project called reviews. In this dataset, there are two columns of interest:

text contains the reviews to be analyzed
polarity reflects the review sentiment: 0 for negative, 1 for positive

Next, go back to the project library, and in chat.py add the following function:

chat.py#

def predict_and_explain_review_sentiment(review: str) -> Dict[str, str]:
    system_msg = f"""
    You are an assistant that classifies reviews according to their sentiment. \
    Respond in json format with the keys: gpt_sentiment and gpt_explanation. \
    The value for gpt_sentiment should only be either pos or neg without punctuation: pos if the review is positive, neg otherwise.\
    The value for gpt_explanation should be a very short explanation for the sentiment.
    """
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": f"Review: {review}"}
    ]
    pred = send_prompt_with_context(messages)
    return json.loads(pred)

Note that the system message was thoroughly customized to align the model with the task at hand, telling it exactly what to do and how to format the output. Crafting and iteratively adjusting the model’s input to guide it toward the desired response is known as prompt engineering.

In order to test your function, you will run it on a small sample of the reviews dataset. For that, create a Python recipe that outputs a single dataset called reviews_sample_gpt_scored with the following code:

recipe.py#

SSIZE = 10

import dataiku
from gpt_utils.chat import predict_and_explain_review_sentiment

input_dataset = dataiku.Dataset("reviews")
new_cols = [
    {"type": "string", "name": "gpt_sentiment"},
    {"type": "string", "name": "gpt_explanation"}
]
output_schema  = input_dataset.read_schema() + new_cols
output_dataset = dataiku.Dataset("reviews_sample_gpt_scored")
output_dataset.write_schema(output_schema)

cnt = 0
with output_dataset.get_writer() as w:
    for r in input_dataset.iter_rows():
        gpt_out = predict_and_explain_review_sentiment(r.get("text"))
        w.write_row_dict({**dict(r), **gpt_out})
        cnt += 1
        if cnt == SSIZE:
            break

This recipe will read the input dataset line-by-line and iteratively send the review text to GPT to retrieve:

the inferred sentiment (0 or 1)
a short explanation of why the review is good or bad

Once the output dataset is built, you can compare the values of the polarity and gpt_sentiment, which should match closely: your classifier is doing well! The gpt_explanation should also give you a quick insight into how the model understood the review.

This technique is called zero-shot classification since it relies on the model’s ability to understand relationships between words and concepts without being specifically trained on labeled data.

Warning

While LLMs show promising capabilities to understand and generate human-like text, they can also sometimes create outputs with pieces of informations or details that aren’t accurate or factual. These mistakes are known as hallucinations and can arise due to the following:

limitations and biases in the model’s training data
the inherent nature of the model to reproduce statistical patterns rather than proper language understanding or reasoning

To mitigate their impact, you should always review any model output that would be part of a critical decision-making process.

Wrapping up#

Congratulations! You have completed this tutorial and gained valuable insights into basic coding features in Dataiku and the OpenAI API. By understanding the basic concepts of language-based generative AI and the relevant tools in Dataiku to leverage them, you are now ready to tackle more complex use cases.

If you want to further experiment beyond this tutorial you can, for example:

Increase the sample size in the recipe by changing the value of SSIZE. By doing so, you should be able to get a decent-sized scored dataset on which you can adequately evaluate the predictive performance of your classifier with metrics such as accuracy, precision or F1-score
Tweak the prompt to improve performance or get more specific explanations

If you want a high-level introduction to LLMs in the context of Dataiku, check out this guide.

Here are the complete versions of the code presented in this tutorial: