Comparing zero-shot learning and few-shot learning using Dolly for efficient text classification#

Prerequisites#

Dataiku>=11.4
“Use” permission on a code environment using Python >= 3.9 with the following packages:
- torch==1.13.1
- transformers==4.30.2
- scikit-learn==1.0.2
- accelerate==0.19.0
Access to an existing project with the following permissions:
- “Read project content” on an existing or new DSS project
- “Write project content”
basic natural language processing (NLP) domain knowledge (tokens, embeddings, etc.)
[HIGHLY RECOMMENDED] GPU-enabled code environment

Introduction#

Large Language Models like Dolly and the GPT family allow for zero-shot learning, i.e., they can generate text without any training data. At the same time, they can also be fine-tuned on a specific task with relatively little training data using a technique called few-shot learning. This tutorial will walk you through how to use an LLM for few-shot learning on a text classification task and compare it to zero-shot learning.

Getting the data#

This tutorial uses the Amazon Review Dataset. The provided Python script downloads one of the subset datasets, creates a small sample of reviews, and uploads it as a dataset named beauty_product_reviews. To use this script, you must create a Python recipe from the flow with an output dataset named beauty_product_reviews and copy the code into the recipe’s code.

labeling sentiment#

        if sentiment in [1, 2]:
            sentiment = "negative"
        elif sentiment == 3:
            sentiment = "neutral"
        elif sentiment in [4, 5]:
            sentiment = "positive"

Note

The code specifies two recommended sample sizes for processing data - 256 with a GPU and 32 without one. Keep in mind that even with a GPU, the number of samples that can feasibly be processed may vary widely, depending on GPU configuration options: type and memory, among others. Determining the appropriate sample size may require some experimentation.

Model & tokens#

In this tutorial, you will use a pre-trained model to predict a classification label for these reviews. After loading the relevant libraries (see the complete code), the provided script, which can be run as a recipe or in a notebook, loads a model and the corresponding tokenizer. The transformers library provides a simple interface to load and use pre-trained models and tokenizers for various natural language processing tasks. Loading models from remote repositories like Hugging Face is a common practice rather than training models for specific NLP tasks.

Here, you’ll be using the databricks/dolly-v2-3b model. It is trained on a diverse range of English-language text from the internet and similar sources and fine-tuned using an instruction corpus. Here is the code for loading the model and the tokenizer:

loading model#

tokenizer = AutoTokenizer.from_pretrained(
    "databricks/dolly-v2-3b",
    padding_side="left"
)
model = AutoModelForCausalLM.from_pretrained(
    "databricks/dolly-v2-3b",
    device_map="auto",
    torch_dtype=torch.bfloat16
).eval()

Caution

Loading the model and tokenizer as common objects across multiple uses is a good practice because it saves time and resources. Pre-trained models and tokenizers can be quite large (this one is over five gigs!). Loading them multiple times can be memory-intensive and slow, depending on the bandwidth available.

You load a small set of product reviews using the dataset (beauty_product_reviews) already in the project. Specify the target labels and their corresponding token ids using the encode() method. These are the main steps for preparing data and objects for the text generation pipeline.

Once these assets are loaded, you can set BATCH_SIZE to a constant–in this case, 8. This is the number of samples that pass through the model’s neural network during training. The batch size can affect the speed and stability of the training process, with a larger batch giving faster training times, but it could also lead to overfitting and instability.

This tutorial compares two text classification techniques. Zero-shot learning leverages a pre-trained model’s understanding of word and concept relationships without further training. Few-shot learning, however, enhances the model with training examples via the prompt, avoiding retraining. You’ll walk through a simple implementation of each to see that even a few examples are better than none. In practice, you might choose one or the other depending on the use case.

Zero-shot learning for text classification#

First, perform zero-shot learning on the dataset using the model. Define a prompt asking the model to decide whether a given product review is positive, neutral, or negative without any examples. Using the pre-trained model and tokenizer, the code iterates over the dataset in batches, tokenizing the prompts and predicting the probability of the labels. The predicted sentiment (positive vs. neutral vs. negative) is then concatenated to the original dataset, along with the probabilities for each sentiment. You have used a pre-trained language model for zero-shot learning on a dataset.

for i in range(0, len(df), BATCH_SIZE):
    print(i)
    # Instantiate the prompts
    prompts = [prompt0.format(txt) for txt in df["text"][i:i+BATCH_SIZE]]
    
    # Tokenize the prompts and compute the next token probabilities with the model
    input_ids = tokenizer(prompts, return_tensors="pt", padding=True).input_ids
    with torch.no_grad():
        outputs = model(input_ids.to(DEVICE))
    result = torch.nn.Softmax(dim=-1)(outputs.logits[:, -1, target_token_ids])

Few-shot learning for text classification#

For few-shot learning, start by defining a prompt that asks the model to decide whether a given product review is positive, neutral, or negative. This time, though, you’ll include a few examples with it. Here is how the examples look based on the code snippet:

“I love my new chess board!” is positive.

“Not what I expected but I guess it’ll do” is neutral.

“I’m so disappointed; the product seemed much better on the website” is negative.

# Build the prompt with examples
prompt = "Decide whether the following product reviews' sentiment is positive, neutral, or negative."
examples = [
    (
        "I love my new chess board!",
        "positive"
    ),
    (
        "Not what I expected but I guess it'll do",
        "neutral"
    ),
    (
        "I'm so disappointed. The product seemed much better on the website",
        "negative"
    )

Again, the predicted probabilities for each sentiment (positive vs. neutral vs. negative) are concatenated to the original dataset, along with the pre-trained language model for few-shot learning as well!

Conclusion#

The model metrics (accuracy, F1-score) demonstrate how few-shot learning can improve a classification task’s performance and efficiency. By defining a few-shot prompt with examples and computing the next token probabilities with a pre-trained language model, you can perform classification on a dataset with limited labeled data. When evaluated using the ground truth, the metrics show that few-shot learning is more efficient. It will likely have much higher accuracy and F1 scores, demonstrating gains from using few-shot learning for classification tasks.

from sklearn.metrics import f1_score
f1_zeroshot = f1_score(df_zeroshot["sentiment"], df_zeroshot["prediction"], average="weighted")
f1_fewshot = f1_score(df_fewshot["sentiment"], df_fewshot["prediction"], average="weighted")
print("Zero-shot accuracy:", acc_zeroshot)
print("Few-shot accuracy:", acc_fewshot)
print("Zero-shot F1 score:", f1_zeroshot)
print("Few-shot F1 score:", f1_fewshot)

You can find two more end-to-end examples using both methods with the OpenAI API instead (GPT-based zero-shot text classification with the OpenAI API and GPT-based few-shot classification with the OpenAI API).

Complete code#

Python recipe or notebook for performance comparison

import dataiku
import pandas as pd

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
print("Running script ...\n\n\n")
BATCH_SIZE = 8
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"DEVICE: {DEVICE}, processing batches of {BATCH_SIZE}\n\n")
print("Loading model ..\n\n" )

tokenizer = AutoTokenizer.from_pretrained(
    "databricks/dolly-v2-3b",
    padding_side="left"
)
model = AutoModelForCausalLM.from_pretrained(
    "databricks/dolly-v2-3b",
    device_map="auto",
    torch_dtype=torch.bfloat16
).eval()

DATASET_NAME = "beauty_product_reviews"
df = dataiku.Dataset(DATASET_NAME).get_dataframe()

print(df.info())

## Zero-shot
print("\n\nZero-shot \n")

# Build the zero-shot prompt
prompt0 = "Decide whether the following product review's sentiment is positive, neutral, or negative.\n\nProduct review:\n{}\nSentiment:"

target_tokens = ['positive', 'neutral', 'negative']
target_token_ids = [tokenizer.encode(k)[0] for k in target_tokens]

# target_tokens, target_token_ids

results0 = None

for i in range(0, len(df), BATCH_SIZE):
    print(i)
    # Instantiate the prompts
    prompts = [prompt0.format(txt) for txt in df["text"][i:i+BATCH_SIZE]]
    
    # Tokenize the prompts and compute the next token probabilities with the model
    input_ids = tokenizer(prompts, return_tensors="pt", padding=True).input_ids
    with torch.no_grad():
        outputs = model(input_ids.to(DEVICE))
    result = torch.nn.Softmax(dim=-1)(outputs.logits[:, -1, target_token_ids])
    
    if results0 is None:
        results0 = result
    else:
        results0 = torch.cat((results0, result), axis=0)

predicted_token_ids = torch.argmax(results0, axis=1)
predictions0 = [target_tokens[i] for i in predicted_token_ids]

scores0_df = pd.DataFrame(
    results0.float().cpu().numpy(),
    columns=[f"proba_{k}" for k in target_tokens]
)

df_zeroshot = pd.concat([df, pd.Series(predictions0, name='prediction'), scores0_df], axis=1)

## Few-shot
print("\n\nFew-shot \n")

# Build the prompt with examples
prompt = "Decide whether the following product reviews' sentiment is positive, neutral, or negative."
examples = [
    (
        "I love my new chess board!",
        "positive"
    ),
    (
        "Not what I expected but I guess it'll do",
        "neutral"
    ),
    (
        "I'm so disappointed. The product seemed much better on the website",
        "negative"
    )
]
for example in examples:
    prompt += f"\n\nProduct review:\n{example[0]}\nSentiment:\n{example[1]}"
prompt += "\n\nProduct review:\n{}\nSentiment:\n"


results = None

for i in range(0, len(df), BATCH_SIZE):
    print(i)
    # Instantiate the prompts
    prompts = [prompt.format(txt) for txt in df["text"][i:i+BATCH_SIZE]]
    # Tokenize the prompts and compute the next token probabilities with the model
    input_ids = tokenizer(prompts, return_tensors="pt", padding=True).input_ids
    with torch.no_grad():
        outputs = model(input_ids.to(DEVICE))
    result = torch.nn.Softmax(dim=-1)(outputs.logits[:, -1, target_token_ids])
    if results is None:
        results = result
    else:
        results = torch.cat((results, result), axis=0)

predicted_token_ids = torch.argmax(results, axis=1)
predictions = [target_tokens[i] for i in predicted_token_ids]
scores_df = pd.DataFrame(
    results.float().cpu().numpy(),
    columns=[f"proba_{k}" for k in target_tokens]
)

df_fewshot = pd.concat([df, pd.Series(predictions, name='prediction'), scores_df], axis=1)

from sklearn.metrics import accuracy_score
acc_zeroshot = accuracy_score(df_zeroshot["sentiment"], df_zeroshot["prediction"])
acc_fewshot = accuracy_score(df_fewshot["sentiment"], df_fewshot["prediction"])
acc_zeroshot, acc_fewshot
print("Classification completed! And now calculating the results you've been waiting for ..")

import random
print("".join(str(random.randint(0, 9)) for _ in range(50)))
print("".join(str(random.randint(0, 9)) for _ in range(30)))
print("".join(str(random.randint(0, 9)) for _ in range(20)))
print ("ALL DONE\n\n")

from sklearn.metrics import f1_score
f1_zeroshot = f1_score(df_zeroshot["sentiment"], df_zeroshot["prediction"], average="weighted")
f1_fewshot = f1_score(df_fewshot["sentiment"], df_fewshot["prediction"], average="weighted")
print("Zero-shot accuracy:", acc_zeroshot)
print("Few-shot accuracy:", acc_fewshot)
print("Zero-shot F1 score:", f1_zeroshot)
print("Few-shot F1 score:", f1_fewshot)