Few-shot classification with the LLM Mesh#

In the zero-shot classification tutorial, you learned how to build a classifier that can perform reasonably well on a task that has not been explicitly trained. However, additional steps are required in other scenarios where the classification task can be more difficult.

In this tutorial, you will learn about few-shot learning, a convenient way to improve the model’s performance by showing relevant examples without retraining or fine-tuning it. You will also dive deeper into understanding prompt tokens and see how LLMs can be evaluated.

Prerequisites#

Dataiku >= 13.1
“Use” permission on a code environment using Python >= 3.9 with the following packages:
- langchain (tested with version 0.2.0)
- transformers (tested with version 4.43.3)
- scikit-learn (tested with version 1.2.2)
Access to an existing project with the following permissions:
- “Read project content”
- “Write project content”
An LLM Mesh connection (to get a valid LLM ID, please refer to this documentation)
Reading and implementing the tutorial on zero-shot classification

Preparing the data#

For this tutorial, you will work on a different dataset to predict more than two classes. This dataset is part of the Amazon Review Dataset, and contains an extract of the “Magazine subscriptions” category. Download the data file here and create a dataset called reviews_magazines with it in your project.

Counting tokens#

Language models process text as tokens, which are recurring sequences of characters, to understand the statistical links between them and infer the next probable ones. Counting tokens in a text input is helpful because:

It is the main pricing unit for managed LLM services (more details on OpenAI’s dedicated page),
It can specify minimum/maximum thresholds for input and output lengths.

You will use the method get_num_tokens() to enrich your dataset with the number of tokens per review.

Binning ratings and splitting the data#

Next, select the reviews_magazines dataset; create a Python recipe with two outputs called reviews_mag_train and reviews_mag_test; copy Code 1 as the recipe’s content.

Code 1: compute_reviews_mag_train.py#

import dataiku
import random

LLM_ID = "" # Fill with your LLM-Mesh id
lc_llm = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_llm()


def bin_score(x):
    return 'pos' if float(x) >= 4.0 else ('ntr' if float(x) == 3.0 else 'neg')


random.seed(1337)

input_dataset = dataiku.Dataset("reviews_magazines")

output_schema = [
    {"type": "string", "name": "reviewText"},
    {"type": "int", "name": "nb_tokens"},
    {"type": "string", "name": "sentiment"}
]
train_dataset = dataiku.Dataset("reviews_mag_train")
train_dataset.write_schema(output_schema)
w_train = train_dataset.get_writer()

test_dataset = dataiku.Dataset("reviews_mag_test")
test_dataset.write_schema(output_schema)
w_test = test_dataset.get_writer()

for r in input_dataset.iter_rows():
    text = r.get("reviewText")
    if len(text) > 0:
        out_row = {
            "reviewText": text,
            "nb_tokens": lc_llm.get_num_tokens(text),
            "sentiment": bin_score(r.get("overall"))
        }
        rnd = random.random()
        if rnd < 0.5:
            w_train.write_row_dict(out_row)
        else:
            w_test.write_row_dict(out_row)

w_train.close()
w_test.close()

This code bins the rating scores from the overall column into a new categorical column called sentiment where the values can be either:

pos (positive) for ratings \(\geq 4\),
ntr (neutral) for ratings \(=3\),
neg (negative) otherwise.

It also removes useless columns to keep only sentiment and reviewText (the content of the user review) and randomly dispatches output rows between:

reviews_mag_test on which you’ll run and evaluate your classifier,
reviews_mag_train which role will be explained later.

Building a zero-shot-based baseline#

Establish a baseline using a zero-shot classification prompt on the test dataset.

Defining the prompt#

Create a directory named utils in your project library and create chat.py with the content shown in Code 2.

Code 2: chat.py#

from dataikuapi.dss.langchain import DKUChatModel
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
import json
from typing import Dict, List


def predict_sentiment(chat: DKUChatModel, review: str):
    system_msg = """
    You are an assistant that classifies reviews according to their sentiment. 
    Respond strictly with this JSON format: {"llm_sentiment": "xxx"} where xxx should only be either: 
    pos if the review is positive 
    ntr if the review is neutral or does not contain enough information 
    neg if the review is negative 
    No other value is allowed.
    """

    messages = [
        SystemMessage(content=system_msg),
        HumanMessage(content=f"""Review: {review}""")
    ]
    resp = chat.invoke(messages)
    return json.loads(resp.content)

predict_sentiment() defines the multi-class classification task by telling the model which classes to expect (pos, ntr, neg) and how to format the output.

From there you can write the code for the zero-shot run.

Running and evaluating the model#

Create and run a Python recipe using reviews_mag_test as input and a new dataset called test_zs_scored as output, then add the following code:

Code 3: compute_test_zs_scored.py#

import dataiku
from utils.chat import predict_sentiment

LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)

input_dataset = dataiku.Dataset("reviews_mag_test")
new_cols = [
    {"type": "string", "name": "llm_sentiment"},
    {"type": "int", "name": "nb_tokens"}
]
output_schema = input_dataset.read_schema() + new_cols
output_dataset = dataiku.Dataset("test_zs_scored")
output_dataset.write_schema(output_schema)

# Run prompts on test dataset
with output_dataset.get_writer() as w:
    for i, r in enumerate(input_dataset.iter_rows()):
        print(f"{i+1}")
        out_row = {}
        # Keep columns from input dataset
        out_row.update(dict(r))
        # Add LLM output
        out_row.update(predict_sentiment(chat=chat,
                                         review=r.get("reviewText")))
        w.write_row_dict(out_row)

That recipe iterates over the test dataset and infers the review text’s sentiment in the llm_sentiment column. Once the test_zs_scored dataset is built, you can evaluate your classifier’s performance: in your project library, create a new file under utils called evaluate.py with Code 4.

Code 4: evaluate.py#

import pandas as pd

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from typing import Dict


def get_classif_metrics(df: pd.DataFrame,
                        pred_col: str,
                        truth_col: str) -> Dict[str, float]:
    metrics = {
        "precision": precision_score(y_pred=df[pred_col],
                                     y_true=df[truth_col],
                                     average="macro"),
        "recall": recall_score(y_pred=df[pred_col],
                               y_true=df[truth_col],
                               average="macro")
    }
    return metrics

You can now use the get_classif_metrics() function to compute the precision and recall scores on the test dataset by running Code 5 in a notebook.

Code 5: evaluate test_zs_scored#

import warnings
warnings.filterwarnings('ignore')

###
import dataiku
from utils.evaluate import get_classif_metrics

df = dataiku.Dataset("test_zs_scored") \
    .get_dataframe(columns=["sentiment", "llm_sentiment"])
metrics_zs = get_classif_metrics(df, "llm_sentiment", "sentiment")
print(metrics_zs)
###

{'precision': 0.61, 'recall': 0.71}

Implementing few-shot learning#

Next, you’ll attempt to improve the baseline model’s performance using few-shot learning, which supplements the model with training examples via the prompt without requiring retraining.

There are many ways to identify relevant training examples; in this tutorial, you will use a relatively intuitive approach:

Start by running a zero-shot classification on the training dataset,
Flag a subset of the resulting false positives/negatives to add to your prompt at evaluation time.

Retrieving relevant examples#

Create and run a Python recipe using reviews_mag_train as input and a new dataset called train_fs_examples as output with the Code 6 as content.

Code 6: compute_train_fs_examples.py#

import dataiku
from utils.chat import predict_sentiment

SIZE_EX_MIN = 20
SIZE_EX_MAX = 100
NB_EX_MAX = 10

LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)

input_dataset = dataiku.Dataset("reviews_mag_train")
input_schema = input_dataset.read_schema()

output_dataset = dataiku.Dataset("train_fs_examples")
new_cols = [
    {"type": "string", "name": "llm_sentiment"}
]
output_schema = input_schema + new_cols
output_dataset.write_schema(output_schema)

nb_ex = 0
with output_dataset.get_writer() as w:
    for r in input_dataset.iter_rows():
        # Check token-base length
        nb_tokens = r.get("nb_tokens")
        if nb_tokens > SIZE_EX_MIN and nb_tokens < SIZE_EX_MAX:
            pred = predict_sentiment(chat, r.get("reviewText"))

            # Keep prediction only if it was mistaken
            if pred["llm_sentiment"] != r.get("sentiment"):
                out_row = dict(r)
                out_row["llm_sentiment"] = pred["llm_sentiment"]
                w.write_row_dict(out_row)
                nb_ex += 1
                if nb_ex == NB_EX_MAX:
                    break

This code iterates over the training data, filters out the reviews whose (token-based) size is not between SIZE_EX_MIN and SIZE_EX_MAX, and then writes the prediction in the output dataset only if it was mistaken. There is also a limit of 10 examples defined by NB_EX_MAX to make sure that at evaluation time, the augmented prompts do not increase the model’s cost and execution time too much.

Running and evaluating the new model#

The next step is incorporating those examples into your prompt before re-running the classification process.

Let’s start by updating the prompt: Examples are added as user/assistant exchanges following the system message when using the LLM. To apply this, in your project library’s chat.py file, add the following functions:

Code 7: chat.py#

def build_example_msg(rec: Dict) -> List[Dict]:
    example = [
        {"Review": rec['reviewText'], "llm_sentiment": rec['sentiment']}
    ]
    return example

The build_example_msg() function helps transform a record’s raw data into dicts that follow the Langchain message formalism.

Code 8: chat.py#

def predict_sentiment_fs(chat: DKUChatModel, review: str, examples: List):
    system_msg = """
    You are an assistant that classifies reviews according to their sentiment. 
    Respond strictly with this JSON format: {"llm_sentiment": "xxx"} where xxx should only be either: 
    pos if the review is positive 
    ntr if the review is neutral or does not contain enough information 
    neg if the review is negative 
    No other value is allowed.
    """

    messages = [
        SystemMessage(content=system_msg),
    ]
    for ex in examples:
        messages.append(HumanMessage(ex.get('Review')))
        messages.append(AIMessage(ex.get('llm_sentiment')))

    messages.append(HumanMessage(content=f"""Review: {review}"""))

    resp = chat.invoke(messages)
    return {'llm_sentiment': resp.content}

The predict_sentiment_fs() function is a modified version of predict_sentiment() that adds a new examples argument to enrich the prompt for few-shot learning.

With these new tools, you can execute a few-shot learning run on the test dataset! Create a Python recipe with train_fs_examples and reviews_mag_test as input, and a new output dataset called test_fs_scored and Code 9 as content.

Code 9: compute_test_fs_scored.py#

import dataiku
from utils.chat import predict_sentiment_fs
from utils.chat import build_example_msg

MIN_EX_LEN = 5
MAX_EX_LEN = 200
MAX_NB_EX = 10

LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)

input_dataset = dataiku.Dataset("reviews_mag_test")
new_cols = [
    {"type": "string", "name": "llm_sentiment"},
]

# Retrieve a few examples from the training dataset
examples_dataset = dataiku.Dataset("train_fs_examples")
ex_to_add = []
tot_tokens = 0
for r in examples_dataset.iter_rows():
    nb_tokens = r.get("nb_tokens")
    if (nb_tokens > MIN_EX_LEN and nb_tokens < MAX_EX_LEN):
        ex_to_add += build_example_msg(dict(r))
        tot_tokens += nb_tokens
        if len(ex_to_add) == MAX_NB_EX:
            print(f"Total tokens = {tot_tokens}")
            break

output_schema = input_dataset.read_schema() + new_cols
output_dataset = dataiku.Dataset("test_fs_scored")
output_dataset.write_schema(output_schema)

# Run prompts on test dataset
with output_dataset.get_writer() as w:
    for i, r in enumerate(input_dataset.iter_rows()):
        out_row = {}
        # Keep columns from input dataset
        out_row.update(dict(r))
        # Add LLM output
        result = predict_sentiment_fs(chat=chat,
                                      review=r.get("reviewText"),
                                      examples=ex_to_add)
        if result:
            out_row.update(result)
        w.write_row_dict(out_row)

You can now finally assess the benefits of few-shot learning by comparing the classifier’s performance with and without the examples! To do so, run this code in a notebook:

###
import dataiku
from utils.evaluate import get_classif_metrics

df = dataiku.Dataset("test_fs_scored") \
    .get_dataframe(columns=["sentiment", "llm_sentiment"])
metrics_fs = get_classif_metrics(df, "llm_sentiment", "sentiment")
print(metrics_fs)
###

{'precision': 0.48, 'recall': 0.54}

Your Flow has reached its final form and should look like this:

Wrapping up#

Congratulations on finishing this (lengthy!) tutorial on few-shot learning! You now have a better overview of how to enrich a prompt to improve the behavior of an LLM on a classification task. Feel free to play with the prompt and the various parameters to see how they can influence the model’s performance! You can also explore other leads to improve the tutorial’s code:

From an ML perspective, the datasets suffer from class imbalance since there are many more positive reviews than negative or neutral ones. You can mitigate that by resampling the initial dataset or by setting up class weights. You can also adjust the number and classes of the few-shot examples to help classify data points belonging to the minority classes.
From a tooling perspective, you can make prompt building even more modular by relying on libraries such as Langchain or Guidance that offer rich prompt templating features.

Finally, you will find below the complete versions of the code presented in this tutorial.

Happy prompt engineering !