Using LLM Mesh to benchmark zero-shot classification models#

Prerequisites#

Dataiku >= 12.4
Access to at least 2 LLM connections (see the reference documentation for configuration details). This tutorial uses OpenAI’s GPT-3.5-turbo and GPT-4, but you can easily swap them with other providers/models.
Access to an existing project with the following permissions:
- “Read project content”
- “Write project content”

LLM mesh API basics: prompting a model#

With Dataiku’s LLM mesh capabilities, you can leverage the power of various LLM types using a unified programmatic interface provided by the public API. More specifically, the Python API client allows you to easily manipulate query and response objects you send to/get from your LLM.

Authentication is fully handled by the LLM connection settings under the hood so that you can focus solely on the essentials parts of your code.

As a first example, let’s see how to query a GPT-3.5-Turbo model. Run the following code from a notebook:

basic_query.py#

import dataiku

GPT_35_LLM_ID = "" # Fill with your gpt-3.5-turbo LLM id

client = dataiku.api_client()
project = client.get_default_project()
llm = project.get_llm(GPT_35_LLM_ID)

compl = llm.new_completion()
q = compl.with_message("Write a one-sentence positive review for the Lord of The Rings movie trilogy.")
resp = q.execute()
if resp.success:
    print(resp.text)
else:
    raise Exception("LLM inference failed!")

# The Lord of the Rings is a thrilling epic adventure that follows a group of
# unlikely heroes as they journey through dangerous lands in order to destroy
# a powerful ring and save their world from eternal darkness.

Here is what happens under the hood with this code snippet:

A DSSLLM object is instantiated using the LLM connection mapped associated with the specified LLM id. If you don’t have your LLM ID at hand, you can use the list_llms() method to list all available LLMs within the project.
From this DSSLLM object, a DSSLLMCompletionQuery object is created to serve as the building ground for the user prompt. This prompt is built by adding one (or more) messages with the with_message() method.
Once the prompt is built, the query is executed. It returns a DSSLLMCompletionResponse object that you can use to check if the model inference was run successfully and retrieve the model’s output.

Note also that your code doesn’t call any external dependency: for this simple use-case, everything you need is the Python API client.

Now for the interesting part: if you want to swap the GPT-3.5-turbo model with another LLM, you only need to change the LLM id. The rest of the code remains exactly the same. This is one of the main strengths of the LLM mesh: allowing developers to write provider-agnostic code when prompting models.

Classifying movie reviews#

Let’s look at a more elaborate use-case: movie review classification. To perform this task, you will need to build a more elaborate prompt that will:

align with the task at hand,
generate standardized outputs.

To do so, you can rely on system messages whose role is to define the model’s behavior. In practice, you describe this behavior in the with_message() method by passing the role='system' parameter. Here is an example:

basic_review.py#

import dataiku

GPT_35_LLM_ID = "" # Fill with your gpt-3.5-turbo LLM id

client = dataiku.api_client()
project = client.get_default_project()
llm = project.get_llm(GPT_35_LLM_ID)

reviews = [
    {
        "sentiment": 0,
        "text": "This movie was horrible: bad actor performance, poor scenario and ugly CGI effects."
    },
    {
        "sentiment": 1,
        "text": "Beautiful movie with an original storyline and top-notch actor performance."
    }
]

compl = llm.new_completion()

sys_msg = """
You are a movie review expert.
Your task is to classify movie review sentiments in two categories: 0 for negative reviews,
1 for positive reviews. Do not answer anything else than 0 or 1."
"""

for r in reviews:
    q = compl \
        .with_message(role="system", message=sys_msg) \
        .with_message(f"Classify this movie review: {r['text']}")
    resp = q.execute()
    if resp.success:
        print(f"{r[r'text']}\n Inference: {resp.text}\n Ground truth: {r['sentiment']}\n{20*'---'}")
    else:
        raise Exception("LLM inference failed!")

To make it more composable, you can wrap this code into a function and place it in your project Libraries. Go to your project library and, under python/, create a new directory called review_code. Inside that directory, create two files:

__init__.py that should be left empty,
models.py that will contain our helper functions.

Add the following code to models.py:

models.py#

from typing import Dict
from dataikuapi.dss.llm import DSSLLM


def zshot_clf(model: DSSLLM, row: Dict[str,str]) -> Dict[str,str]:
    
    sys_msg = """
    You are a movie review expert.
    Your task is to classify movie review sentiments in two categories: 0 for negative reviews,
    1 for positive reviews. Answer only with one character that is either 0 or 1.
    """

    compl = model.new_completion()
    q = compl \
        .with_message(role="system", message=sys_msg) \
        .with_message(f"Classify this movie review: {row['text']}")
    resp = q.execute()
    if resp.success:
        out_row = dict(row)
        out_row["prediction"] = str(resp.text)
        out_row["llm_id"] = model.llm_id
        return out_row
    else:
        raise Exception(f"LLM inference failed for input row:\n{row}")

Your next task is to gather the input dataset containing the movie reviews. The dataset you will work on is an extract from the Large Movie Review Dataset. Download the file here and use it to create a dataset in your project called reviews. In this dataset, there are two columns of interest:

text contains the reviews to be analyzed,
polarity reflects the review sentiment: 0 for negative, 1 for positive.

Next, from the reviews dataset, create a Python recipe with a single output dataset called reviews_scored with the following code:

recipe_zshot.py#

import dataiku
from review_code.models import zshot_clf

GPT_35_LLM_ID = ""  # Fill with your gpt-3.5-turbo LLM id
N_MAX_OUTPUT_ROWS = 100  # Change this value to increase the number of rows to use

# Create new schema for the output dataset
reviews = dataiku.Dataset("reviews")
schema = reviews.read_schema()
reviews_scored = dataiku.Dataset("reviews_scored")
for c in ["prediction", "llm_id"]:
    schema.append({"name": c, "type": "string"})
reviews_scored.write_schema(schema)


# Retrieve the LLM handle
client = dataiku.api_client()
project = client.get_default_project()
llm = project.get_llm(GPT_35_LLM_ID)

# Iteratively classify reviews
with reviews_scored.get_writer() as w_out:
    for i, row in enumerate(reviews.iter_rows()):
        if i == N_MAX_OUTPUT_ROWS-1:
            break
        w_out.write_row_dict(zshot_clf(llm, row))

After running this recipe, your reviews_scored dataset should be populated with the predicted sentiments. The final step is to compute the performance of your zero-shot classifier since you have the ground truth at your disposal.

To compute the accuracy of your model, run the following code:

acc_single_model.py#

import dataiku
from sklearn.metrics import accuracy_score

client = dataiku.api_client()
project = client.get_default_project()

review_scored_df = dataiku.Dataset("reviews_scored").get_dataframe()
acc = accuracy_score(review_scored_df["polarity"],
                     review_scored_df["prediction"])
print(f"ACC = {acc:.2f}")

You should get a decent accuracy value, but what if you wanted to see how good the model is compared to another one?

Benchmarking multiple LLMs#

In this section, you will see how to easily run the same operation as before over two different models. In practice, you will compare the performance of GPT-3.5-turbo with GPT-4.

Create a new Python recipe with:

reviews as input dataset,
reviews_2_models_scored as a new output dataset.

Add the following code:

recipe_2_models_zshot.py#

import dataiku
from review_code.models import zshot_clf

GPT_35_LLM_ID = ""  # Fill with your gpt-3.5-turbo LLM id
GPT_4_LLM_ID = ""  # Fill with your gpt-4 LLM id
N_MAX_OUTPUT_ROWS = 100  # Change this value to increase the number of rows to use

# Create new schema for the output dataset
reviews = dataiku.Dataset("reviews")
schema = reviews.read_schema()
reviews_scored = dataiku.Dataset("reviews_2_models_scored")
for c in [f"pred_{GPT_35_LLM_ID}", f"pred_{GPT_4_LLM_ID}"]:
    schema.append({"name": c, "type": "string"})
reviews_scored.write_schema(schema)

# Retrieve the LLM handle
client = dataiku.api_client()
project = client.get_default_project()
gpt_35 = project.get_llm(GPT_35_LLM_ID)
gpt_4 = project.get_llm(GPT_4_LLM_ID)

# Iteratively classify reviews with both models
with reviews_scored.get_writer() as w_out:
    for i, row in enumerate(reviews.iter_rows()):
        if i == N_MAX_OUTPUT_ROWS-1:
            break
        print(i)
        gpt = {}
        gpt[f"pred_{GPT_35_LLM_ID}"] = zshot_clf(gpt_35, row).get('prediction')
        gpt[f"pred_{GPT_4_LLM_ID}"]  = zshot_clf(gpt_4, row).get('prediction')
        w_out.write_row_dict({**row, **gpt})

Note that the code barely changes even if you introduce a new model. You just had to create a new handle and call the scoring function a second time, but nothing more!

To compare how both models are doing in terms of performance, you can run the following code:

acc_2_models.py#

import dataiku
from sklearn.metrics import accuracy_score

client = dataiku.api_client()
project = client.get_default_project()

GPT_35_LLM_ID = ""  # Fill with your gpt-3.5-turbo LLM id
GPT_4_LLM_ID = ""  # Fill with your gpt-4 LLM id

acc = []

review_scored_df = dataiku.Dataset("reviews_2_models_scored").get_dataframe()
for m in [GPT_35_LLM_ID, GPT_4_LLM_ID]:
    acc.append({"llm_id": m,
                "accuracy": accuracy_score(review_scored_df["polarity"],
                                           review_scored_df[f"pred_{m}"])})
                                           
for ac in acc:
    print(f"ACC({ac.get('llm_id')}) = {ac.get('accuracy'):.2f}")

You will get relatively similar performances for a small value of N_MAX_OUTPUT_ROWS. But as you increase the number of scored records, you should see GPT-4 performing a bit better. Do not hesitate to try different models among the ones at your disposal, as you can now see it only mandates very few changes in the code!

Wrapping up#

Congratulations on finishing this tutorial! You now have a good overview of the LLM mesh completion query capabilities in Dataiku! If you want to go further, you can try:

tweaking the prompt,
running comparisons with more than two models,
leveraging Dataiku’s experiment tracking abilities to better log parameters, prompt variations, and resulting performances.

Here are the complete versions of the code presented in this tutorial: