Few-shot classification with the LLM Mesh#
In the zero-shot classification tutorial, you learned how to build a classifier that can perform reasonably well on a task that has not been explicitly trained. However, additional steps are required in other scenarios where the classification task can be more difficult.
In this tutorial, you will learn about few-shot learning, a convenient way to improve the model’s performance by showing relevant examples without retraining or fine-tuning it. You will also dive deeper into understanding prompt tokens and see how LLMs can be evaluated.
Prerequisites#
Dataiku >= 13.1
“Use” permission on a code environment using Python >= 3.9 with the following packages:
langchain
(tested with version 0.2.0)transformers
(tested with version 4.43.3)scikit-learn
(tested with version 1.2.2)
Access to an existing project with the following permissions:
“Read project content”
“Write project content”
An LLM Mesh connection (to get a valid LLM ID, please refer to this documentation)
Reading and implementing the tutorial on zero-shot classification
Preparing the data#
For this tutorial, you will work on a different dataset to predict more than two classes.
This dataset is part of the Amazon Review Dataset,
and contains an extract of the “Magazine subscriptions” category.
Download the data file here
and create a dataset called reviews_magazines
with it in your project.
Counting tokens#
Language models process text as tokens, which are recurring sequences of characters, to understand the statistical links between them and infer the next probable ones. Counting tokens in a text input is helpful because:
It is the main pricing unit for managed LLM services (more details on OpenAI’s dedicated page),
It can specify minimum/maximum thresholds for input and output lengths.
You will use the method get_num_tokens()
to enrich your dataset with the number of tokens per review.
Binning ratings and splitting the data#
Next, select the reviews_magazines
dataset;
create a Python recipe with two outputs called reviews_mag_train
and reviews_mag_test
;
copy Code 1 as the recipe’s content.
import dataiku
import random
LLM_ID = "" # Fill with your LLM-Mesh id
lc_llm = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_llm()
def bin_score(x):
return 'pos' if float(x) >= 4.0 else ('ntr' if float(x) == 3.0 else 'neg')
random.seed(1337)
input_dataset = dataiku.Dataset("reviews_magazines")
output_schema = [
{"type": "string", "name": "reviewText"},
{"type": "int", "name": "nb_tokens"},
{"type": "string", "name": "sentiment"}
]
train_dataset = dataiku.Dataset("reviews_mag_train")
train_dataset.write_schema(output_schema)
w_train = train_dataset.get_writer()
test_dataset = dataiku.Dataset("reviews_mag_test")
test_dataset.write_schema(output_schema)
w_test = test_dataset.get_writer()
for r in input_dataset.iter_rows():
text = r.get("reviewText")
if len(text) > 0:
out_row = {
"reviewText": text,
"nb_tokens": lc_llm.get_num_tokens(text),
"sentiment": bin_score(r.get("overall"))
}
rnd = random.random()
if rnd < 0.5:
w_train.write_row_dict(out_row)
else:
w_test.write_row_dict(out_row)
w_train.close()
w_test.close()
This code bins the rating scores from the overall
column into a
new categorical column called sentiment
where the values can be either:
pos
(positive) for ratings \(\geq 4\),ntr
(neutral) for ratings \(=3\),neg
(negative) otherwise.
It also removes useless columns to keep only sentiment
and reviewText
(the content of the user review) and randomly
dispatches output rows between:
reviews_mag_test
on which you’ll run and evaluate your classifier,reviews_mag_train
which role will be explained later.
Building a zero-shot-based baseline#
Establish a baseline using a zero-shot classification prompt on the test dataset.
Defining the prompt#
Create a directory named utils
in your project library and create chat.py
with the content shown in
Code 2.
from dataikuapi.dss.langchain import DKUChatModel
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
import json
from typing import Dict, List
def predict_sentiment(chat: DKUChatModel, review: str):
system_msg = """
You are an assistant that classifies reviews according to their sentiment.
Respond strictly with this JSON format: {"llm_sentiment": "xxx"} where xxx should only be either:
pos if the review is positive
ntr if the review is neutral or does not contain enough information
neg if the review is negative
No other value is allowed.
"""
messages = [
SystemMessage(content=system_msg),
HumanMessage(content=f"""Review: {review}""")
]
resp = chat.invoke(messages)
return json.loads(resp.content)
predict_sentiment()
defines the multi-class classification task by telling the model which classes to expect
(pos
, ntr
, neg
) and how to format the output.
From there you can write the code for the zero-shot run.
Running and evaluating the model#
Create and run a Python recipe using reviews_mag_test
as input
and a new dataset called test_zs_scored
as output, then add the following code:
import dataiku
from utils.chat import predict_sentiment
LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)
input_dataset = dataiku.Dataset("reviews_mag_test")
new_cols = [
{"type": "string", "name": "llm_sentiment"},
{"type": "int", "name": "nb_tokens"}
]
output_schema = input_dataset.read_schema() + new_cols
output_dataset = dataiku.Dataset("test_zs_scored")
output_dataset.write_schema(output_schema)
# Run prompts on test dataset
with output_dataset.get_writer() as w:
for i, r in enumerate(input_dataset.iter_rows()):
print(f"{i+1}")
out_row = {}
# Keep columns from input dataset
out_row.update(dict(r))
# Add LLM output
out_row.update(predict_sentiment(chat=chat,
review=r.get("reviewText")))
w.write_row_dict(out_row)
That recipe iterates over the test dataset and infers the review text’s sentiment in the llm_sentiment
column.
Once the test_zs_scored
dataset is built, you can evaluate your classifier’s performance:
in your project library, create a new file under utils
called evaluate.py
with
Code 4.
import pandas as pd
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from typing import Dict
def get_classif_metrics(df: pd.DataFrame,
pred_col: str,
truth_col: str) -> Dict[str, float]:
metrics = {
"precision": precision_score(y_pred=df[pred_col],
y_true=df[truth_col],
average="macro"),
"recall": recall_score(y_pred=df[pred_col],
y_true=df[truth_col],
average="macro")
}
return metrics
You can now use the get_classif_metrics()
function to compute the precision and recall scores on the test dataset
by running Code 5 in a notebook.
import warnings
warnings.filterwarnings('ignore')
###
import dataiku
from utils.evaluate import get_classif_metrics
df = dataiku.Dataset("test_zs_scored") \
.get_dataframe(columns=["sentiment", "llm_sentiment"])
metrics_zs = get_classif_metrics(df, "llm_sentiment", "sentiment")
print(metrics_zs)
###
{'precision': 0.61, 'recall': 0.71}
Implementing few-shot learning#
Next, you’ll attempt to improve the baseline model’s performance using few-shot learning, which supplements the model with training examples via the prompt without requiring retraining.
There are many ways to identify relevant training examples; in this tutorial, you will use a relatively intuitive approach:
Start by running a zero-shot classification on the training dataset,
Flag a subset of the resulting false positives/negatives to add to your prompt at evaluation time.
Retrieving relevant examples#
Create and run a Python recipe using reviews_mag_train
as input and a new dataset called train_fs_examples
as output
with the Code 6 as content.
import dataiku
from utils.chat import predict_sentiment
SIZE_EX_MIN = 20
SIZE_EX_MAX = 100
NB_EX_MAX = 10
LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)
input_dataset = dataiku.Dataset("reviews_mag_train")
input_schema = input_dataset.read_schema()
output_dataset = dataiku.Dataset("train_fs_examples")
new_cols = [
{"type": "string", "name": "llm_sentiment"}
]
output_schema = input_schema + new_cols
output_dataset.write_schema(output_schema)
nb_ex = 0
with output_dataset.get_writer() as w:
for r in input_dataset.iter_rows():
# Check token-base length
nb_tokens = r.get("nb_tokens")
if nb_tokens > SIZE_EX_MIN and nb_tokens < SIZE_EX_MAX:
pred = predict_sentiment(chat, r.get("reviewText"))
# Keep prediction only if it was mistaken
if pred["llm_sentiment"] != r.get("sentiment"):
out_row = dict(r)
out_row["llm_sentiment"] = pred["llm_sentiment"]
w.write_row_dict(out_row)
nb_ex += 1
if nb_ex == NB_EX_MAX:
break
This code iterates over the training data,
filters out the reviews whose (token-based) size is not between SIZE_EX_MIN
and SIZE_EX_MAX
,
and then writes the prediction in the output dataset only if it was mistaken.
There is also a limit of 10 examples defined by NB_EX_MAX
to make sure that at evaluation time,
the augmented prompts do not increase the model’s cost and execution time too much.
Running and evaluating the new model#
The next step is incorporating those examples into your prompt before re-running the classification process.
Let’s start by updating the prompt:
Examples are added as user/assistant exchanges following the system message when using the LLM.
To apply this, in your project library’s chat.py
file, add the following functions:
def build_example_msg(rec: Dict) -> List[Dict]:
example = [
{"Review": rec['reviewText'], "llm_sentiment": rec['sentiment']}
]
return example
The build_example_msg()
function helps transform a record’s raw data into dicts that follow the Langchain message formalism.
def predict_sentiment_fs(chat: DKUChatModel, review: str, examples: List):
system_msg = """
You are an assistant that classifies reviews according to their sentiment.
Respond strictly with this JSON format: {"llm_sentiment": "xxx"} where xxx should only be either:
pos if the review is positive
ntr if the review is neutral or does not contain enough information
neg if the review is negative
No other value is allowed.
"""
messages = [
SystemMessage(content=system_msg),
]
for ex in examples:
messages.append(HumanMessage(ex.get('Review')))
messages.append(AIMessage(ex.get('llm_sentiment')))
messages.append(HumanMessage(content=f"""Review: {review}"""))
resp = chat.invoke(messages)
return {'llm_sentiment': resp.content}
The predict_sentiment_fs()
function is a modified version of predict_sentiment()
that adds a new examples
argument
to enrich the prompt for few-shot learning.
With these new tools, you can execute a few-shot learning run on the test dataset! Create a Python recipe with
train_fs_examples
and reviews_mag_test
as input, and a new output dataset called test_fs_scored
and Code 9 as content.
import dataiku
from utils.chat import predict_sentiment_fs
from utils.chat import build_example_msg
MIN_EX_LEN = 5
MAX_EX_LEN = 200
MAX_NB_EX = 10
LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)
input_dataset = dataiku.Dataset("reviews_mag_test")
new_cols = [
{"type": "string", "name": "llm_sentiment"},
]
# Retrieve a few examples from the training dataset
examples_dataset = dataiku.Dataset("train_fs_examples")
ex_to_add = []
tot_tokens = 0
for r in examples_dataset.iter_rows():
nb_tokens = r.get("nb_tokens")
if (nb_tokens > MIN_EX_LEN and nb_tokens < MAX_EX_LEN):
ex_to_add += build_example_msg(dict(r))
tot_tokens += nb_tokens
if len(ex_to_add) == MAX_NB_EX:
print(f"Total tokens = {tot_tokens}")
break
output_schema = input_dataset.read_schema() + new_cols
output_dataset = dataiku.Dataset("test_fs_scored")
output_dataset.write_schema(output_schema)
# Run prompts on test dataset
with output_dataset.get_writer() as w:
for i, r in enumerate(input_dataset.iter_rows()):
out_row = {}
# Keep columns from input dataset
out_row.update(dict(r))
# Add LLM output
result = predict_sentiment_fs(chat=chat,
review=r.get("reviewText"),
examples=ex_to_add)
if result:
out_row.update(result)
w.write_row_dict(out_row)
You can now finally assess the benefits of few-shot learning by comparing the classifier’s performance with and without the examples! To do so, run this code in a notebook:
###
import dataiku
from utils.evaluate import get_classif_metrics
df = dataiku.Dataset("test_fs_scored") \
.get_dataframe(columns=["sentiment", "llm_sentiment"])
metrics_fs = get_classif_metrics(df, "llm_sentiment", "sentiment")
print(metrics_fs)
###
{'precision': 0.48, 'recall': 0.54}
Your Flow has reached its final form and should look like this:
Wrapping up#
Congratulations on finishing this (lengthy!) tutorial on few-shot learning! You now have a better overview of how to enrich a prompt to improve the behavior of an LLM on a classification task. Feel free to play with the prompt and the various parameters to see how they can influence the model’s performance! You can also explore other leads to improve the tutorial’s code:
From an ML perspective, the datasets suffer from class imbalance since there are many more positive reviews than negative or neutral ones. You can mitigate that by resampling the initial dataset or by setting up class weights. You can also adjust the number and classes of the few-shot examples to help classify data points belonging to the minority classes.
From a tooling perspective, you can make prompt building even more modular by relying on libraries such as Langchain or Guidance that offer rich prompt templating features.
Finally, you will find below the complete versions of the code presented in this tutorial.
Happy prompt engineering !
chat.py
from dataikuapi.dss.langchain import DKUChatModel
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
import json
from typing import Dict, List
def predict_sentiment(chat: DKUChatModel, review: str):
system_msg = """
You are an assistant that classifies reviews according to their sentiment.
Respond strictly with this JSON format: {"llm_sentiment": "xxx"} where xxx should only be either:
pos if the review is positive
ntr if the review is neutral or does not contain enough information
neg if the review is negative
No other value is allowed.
"""
messages = [
SystemMessage(content=system_msg),
HumanMessage(content=f"""Review: {review}""")
]
resp = chat.invoke(messages)
return json.loads(resp.content)
def build_example_msg(rec: Dict) -> List[Dict]:
example = [
{"Review": rec['reviewText'], "llm_sentiment": rec['sentiment']}
]
return example
def predict_sentiment_fs(chat: DKUChatModel, review: str, examples: List):
system_msg = """
You are an assistant that classifies reviews according to their sentiment.
Respond strictly with this JSON format: {"llm_sentiment": "xxx"} where xxx should only be either:
pos if the review is positive
ntr if the review is neutral or does not contain enough information
neg if the review is negative
No other value is allowed.
"""
messages = [
SystemMessage(content=system_msg),
]
for ex in examples:
messages.append(HumanMessage(ex.get('Review')))
messages.append(AIMessage(ex.get('llm_sentiment')))
messages.append(HumanMessage(content=f"""Review: {review}"""))
resp = chat.invoke(messages)
return {'llm_sentiment': resp.content}
compute_reviews_mag_train.py
import dataiku
import random
LLM_ID = "" # Fill with your LLM-Mesh id
lc_llm = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_llm()
def bin_score(x):
return 'pos' if float(x) >= 4.0 else ('ntr' if float(x) == 3.0 else 'neg')
random.seed(1337)
input_dataset = dataiku.Dataset("reviews_magazines")
output_schema = [
{"type": "string", "name": "reviewText"},
{"type": "int", "name": "nb_tokens"},
{"type": "string", "name": "sentiment"}
]
train_dataset = dataiku.Dataset("reviews_mag_train")
train_dataset.write_schema(output_schema)
w_train = train_dataset.get_writer()
test_dataset = dataiku.Dataset("reviews_mag_test")
test_dataset.write_schema(output_schema)
w_test = test_dataset.get_writer()
for r in input_dataset.iter_rows():
text = r.get("reviewText")
if len(text) > 0:
out_row = {
"reviewText": text,
"nb_tokens": lc_llm.get_num_tokens(text),
"sentiment": bin_score(r.get("overall"))
}
rnd = random.random()
if rnd < 0.5:
w_train.write_row_dict(out_row)
else:
w_test.write_row_dict(out_row)
w_train.close()
w_test.close()
compute_test_fs_scored.py
import dataiku
from utils.chat import predict_sentiment_fs
from utils.chat import build_example_msg
MIN_EX_LEN = 5
MAX_EX_LEN = 200
MAX_NB_EX = 10
LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)
input_dataset = dataiku.Dataset("reviews_mag_test")
new_cols = [
{"type": "string", "name": "llm_sentiment"},
]
# Retrieve a few examples from the training dataset
examples_dataset = dataiku.Dataset("train_fs_examples")
ex_to_add = []
tot_tokens = 0
for r in examples_dataset.iter_rows():
nb_tokens = r.get("nb_tokens")
if (nb_tokens > MIN_EX_LEN and nb_tokens < MAX_EX_LEN):
ex_to_add += build_example_msg(dict(r))
tot_tokens += nb_tokens
if len(ex_to_add) == MAX_NB_EX:
print(f"Total tokens = {tot_tokens}")
break
output_schema = input_dataset.read_schema() + new_cols
output_dataset = dataiku.Dataset("test_fs_scored")
output_dataset.write_schema(output_schema)
# Run prompts on test dataset
with output_dataset.get_writer() as w:
for i, r in enumerate(input_dataset.iter_rows()):
out_row = {}
# Keep columns from input dataset
out_row.update(dict(r))
# Add LLM output
result = predict_sentiment_fs(chat=chat,
review=r.get("reviewText"),
examples=ex_to_add)
if result:
out_row.update(result)
w.write_row_dict(out_row)
compute_test_zs_scored.py
import dataiku
from utils.chat import predict_sentiment
LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)
input_dataset = dataiku.Dataset("reviews_mag_test")
new_cols = [
{"type": "string", "name": "llm_sentiment"},
{"type": "int", "name": "nb_tokens"}
]
output_schema = input_dataset.read_schema() + new_cols
output_dataset = dataiku.Dataset("test_zs_scored")
output_dataset.write_schema(output_schema)
# Run prompts on test dataset
with output_dataset.get_writer() as w:
for i, r in enumerate(input_dataset.iter_rows()):
print(f"{i+1}")
out_row = {}
# Keep columns from input dataset
out_row.update(dict(r))
# Add LLM output
out_row.update(predict_sentiment(chat=chat,
review=r.get("reviewText")))
w.write_row_dict(out_row)
compute_train_fs_examples.py
import dataiku
from utils.chat import predict_sentiment
SIZE_EX_MIN = 20
SIZE_EX_MAX = 100
NB_EX_MAX = 10
LLM_ID = "" # Fill with your LLM-Mesh id
chat = dataiku.api_client().get_default_project().get_llm(LLM_ID).as_langchain_chat_model(temperature=0)
input_dataset = dataiku.Dataset("reviews_mag_train")
input_schema = input_dataset.read_schema()
output_dataset = dataiku.Dataset("train_fs_examples")
new_cols = [
{"type": "string", "name": "llm_sentiment"}
]
output_schema = input_schema + new_cols
output_dataset.write_schema(output_schema)
nb_ex = 0
with output_dataset.get_writer() as w:
for r in input_dataset.iter_rows():
# Check token-base length
nb_tokens = r.get("nb_tokens")
if nb_tokens > SIZE_EX_MIN and nb_tokens < SIZE_EX_MAX:
pred = predict_sentiment(chat, r.get("reviewText"))
# Keep prediction only if it was mistaken
if pred["llm_sentiment"] != r.get("sentiment"):
out_row = dict(r)
out_row["llm_sentiment"] = pred["llm_sentiment"]
w.write_row_dict(out_row)
nb_ex += 1
if nb_ex == NB_EX_MAX:
break
evaluate.py
import pandas as pd
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from typing import Dict
def get_classif_metrics(df: pd.DataFrame,
pred_col: str,
truth_col: str) -> Dict[str, float]:
metrics = {
"precision": precision_score(y_pred=df[pred_col],
y_true=df[truth_col],
average="macro"),
"recall": recall_score(y_pred=df[pred_col],
y_true=df[truth_col],
average="macro")
}
return metrics