Advanced model alignment: RLHF, RLAIF, and RLVR in Dataiku#

The process of creating Large Language Models (LLMs) involves several distinct stages. While the pre-training phase creates a model that can predict the next token, it is the alignment phase that truly refines how it generates tokens to better suit its role.

While Supervised Fine-Tuning (SFT) is usually used to teach a model to follow instructions, alignment ensures the model’s behavior matches human preferences, follows specific reasoning paths, or adheres to objective truth. These techniques transition a general next-token predictor into a helpful assistant.

In this tutorial, we will explore advanced model alignment in Dataiku, with three primary methods used in the industry today: Reinforcement Learning from Human Feedback (RLHF), AI Feedback (RLAIF), and Verifiable Rewards (RLVR).

Prerequisites#

To follow along with the code examples in this tutorial, you will need:

Dataiku >= 14.1
Python >= 3.10
A code environment with containerized GPU support (container runtime addition GPU support for Torch 2) and the following packages:
```
torch
transformers
datasets
trl
peft
bitsandbytes
accelerate
```
An LLM Mesh connection to HuggingFace.

Reinforcement Learning from Human Feedback (RLHF)#

RLHF is the industry standard for aligning models to human preferences. It uses preference-based optimization to train a model on what a human considers a “good” versus “bad” response.

Dataiku provides a complete environment for RLHF. For data collection, users can leverage AgentHub and native labeling features to create preference datasets directly from human feedback.

Typically, this data is formatted into three columns:

Prompt: The initial instruction or question.
Chosen: The response preferred by the human labeler.
Rejected: The less favorable response.

RLHF in Dataiku#

Once the data is collected, you can align your model using the trl (Transformer Reinforcement Learning) library and the Dataiku API. Using the Direct Preference Optimization (DPO) algorithm is often preferred over traditional Proximal Policy Optimization (PPO) as it is more stable and computationally efficient. The trl provides the DPOTrainer for that.

Note

For a complete code example of implementing DPO in Dataiku, please refer to our dedicated Fine-Tuning Code Samples section in the Developer Guide.

Once fine-tuned, use the create_finetuned_llm_version() method with your HuggingFace connection name as input. This saves the aligned model as a Fine-tuned LLM Saved Model version, inheriting from the connection’s configuration.

Reinforcement Learning from AI Feedback (RLAIF)#

While RLHF is highly effective, gathering human preference data is time-consuming and expensive. RLAIF solves this bottleneck by leveraging a “judge” LLM to generate synthetic feedback, drastically reducing the need for manual human labeling.

Orchestrating RLAIF in Dataiku#

Dataiku enables RLAIF by utilizing the LLM Mesh to orchestrate critique and revision workflows.

Synthetic Generation: Use a Visual Prompt recipe to generate multiple candidate responses from your target model (here, it is the `` model from the HuggingFace connection).
AI Judging: Use a more powerful, aligned model (e.g., GPT-5 or Claude 4 Sonnet) via the LLM Mesh to evaluate and rank the candidates based on specific criteria (e.g., helpfulness, lack of toxicity, etc).
Alignment: The resulting dataset is then used to align the target model via DPO, exactly as you would with RLHF.

Reinforcement Learning with Verifiable Rewards (RLVR)#

For complex reasoning tasks, such as solving mathematical equations, writing functional code, or applying strict formatting rules, preference ranking is insufficient. These tasks require objective verification.

The RLVR method allows users to define strict, hard-coded reward functions to score a model, making it the right alignment method for complex & verifiable tasks. Instead of a human or an AI assessing if a response is good, a rule-based logic system (like a code compiler or a math verifier) objectively scores the output.

Recommended Dataset: GSM8k#

To test RLVR, we recommend the openai/gsm8k dataset, a standard benchmark for mathematical reasoning available on the HuggingFace Hub. It contains thousands of grade-school math problems.

If you look at the raw GSM8k dataset, the answer column contains both the step-by-step reasoning and the final numerical answer, separated by #### (e.g., ...Natalia sold 48+24 = 72 clips. #### 72).

Counterintuitively, for RLVR, we actually want to ignore the human reasoning text.

In traditional Supervised Fine-Tuning (SFT), you force the model to mimic the exact reasoning steps provided in the dataset. You teach the model what to think.
In RLVR, you only reward the final, verifiable outcome. You let the model figure out how to think.

By only rewarding the final correct answer (the number after the ####), the model is free to explore different logical paths during training. To ensure the model still takes time to reason, we apply a secondary reward function that strictly enforces formatting, requiring the model to use <think>...</think> tags before outputting its final <answer>...</answer>.

Assuming you have synced the raw GSM8k data into a Dataiku dataset called gsm8k_raw, here is the Python snippet to automatically format the data (strip out the human reasoning, inject system instructions) and split it. We will create a Python recipe that outputs three new Dataiku datasets: gsm8k_train (85%), gsm8k_val (10%), and gsm8k_test (5%).

Code 1: data_formatting.py#

import dataiku
from datasets import Dataset
import pandas as pd

# 1. Load raw GSM8k data from Dataiku
raw_df = dataiku.Dataset("gsm8k_raw").get_dataframe()
raw_dataset = Dataset.from_pandas(raw_df)

SYSTEM_PROMPT = """
Respond in the following format:
<think>
...
</think>
<answer>
...
</answer>
"""

def prepare_gsm8k(example):
    """
    Extracts the final numeric answer (ignoring human reasoning), 
    removes commas for clean string matching, and injects the system prompt.
    """
    # Extract the number and remove any commas (e.g., "36,000" -> "36000")
    raw_gt = str(example["answer"].split("####")[-1].strip())

    # Removing commas from some groundtruth values, because an LLM would answer without comma and be penalized in the reward function.
    clean_gt = raw_gt.replace(",", "")  
    
    prompt = [
        {"role": "system", "content": SYSTEM_PROMPT}, 
        {"role": "user", "content": example["question"]}
    ]
    return {"prompt": prompt, "ground_truth": clean_gt}

# 2. Apply the formatting to the dataset
prepared_dataset = raw_dataset.map(prepare_gsm8k)

# 3. Split the data into Train (85%), Validation (10%), and Test (5%)
# First split: 85% train, 15% temp
split_1 = prepared_dataset.train_test_split(test_size=0.15, seed=42)
train_dataset = split_1['train']

# Second split: Divide the 15% temp into 2/3 (10% overall) and 1/3 (5% overall)
split_2 = split_1['test'].train_test_split(test_size=0.33, seed=42)
val_dataset = split_2['train']
test_dataset = split_2['test']

# 4. Write back to Dataiku Datasets
dataiku.Dataset("gsm8k_train").write_with_schema(pd.DataFrame(train_dataset))
dataiku.Dataset("gsm8k_val").write_with_schema(pd.DataFrame(val_dataset))
dataiku.Dataset("gsm8k_test").write_with_schema(pd.DataFrame(test_dataset))

After this transformation, your train_dataset holds the exact structure required by the TRL library for verifiable reasoning:

prompt	ground_truth
[{“role”: “system”, “content”: “Respond in the following format…”}, {“role”: “user”, “content”: “Natalia sold clips to 48 of her friends…”}]	72

Implementing RLVR with GRPO in Dataiku#

To train models on complex reasoning chains, we can use the Group Relative Policy Optimization (GRPO) algorithm via the trl library and Dataiku API. This algorithm was introduced in 2024 (Shao et al., 2024), and later popularized by the DeepSeek-R1 reasoning models (DeepSeek-AI, Guo et al., 2025).

Below is an example of how to implement an RLVR workflow in a Dataiku Python recipe. The inputs are gsm8k_train and gsm8k_val, we pass both datasets to the trainer so we can track evaluation metrics during the run. The output is a Fine-tuned LLM Saved Model named rlvr_aligned_model.

We’ll break this down into three steps: preparing the model, defining the rewards, and executing the training loop. A fourth section simply explains how to visually test the alignment using a Prompt recipe.

1. Model & tokenizer preparation#

First, we load our base model. Here, we are using Qwen/Qwen2.5-1.5B-Instruct, a small model, to illustrate how it works. We are also using 4-bit quantization to fit comfortably on a single GPU.

Code 2: model_preparation.py#

import dataiku
from datasets import Dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "Qwen/Qwen2.5-1.5B-Instruct"
connection_name = "a_huggingface_connection_name"

# Load Datasets from the Flow
train_dataset = Dataset.from_pandas(dataiku.Dataset("gsm8k_train").get_dataframe())
val_dataset = Dataset.from_pandas(dataiku.Dataset("gsm8k_val").get_dataframe())

# The output Saved Model
saved_model = dataiku.Model("my_model_id")

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    use_cache=False 
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

2. Defining verifiable rewards#

Next, we define our rule-based logic. We create two functions: one that checks if the extracted answer exactly matches our ground_truth, and another that checks if the model successfully used the XML tags.

Code 3: verifiable_rewards.py#

import re

def extract_xml_answer(text: str) -> str:
    """Helper to extract the answer from XML tags"""
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def correctness_reward_func(prompts, completions, ground_truth, **kwargs) -> list[float]:
    """Awards 2.0 points if the generated answer exactly matches the ground truth."""
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [2.0 if str(r) == str(gt) else 0.0 for r, gt in zip(extracted_responses, ground_truth)]

def format_reward_func(completions, **kwargs) -> list[float]:
    """Awards 1.0 point if the model strictly follows the <think> and <answer> XML format."""
    pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]['content'] for completion in completions]
    return [1.0 if re.match(pattern, r, re.DOTALL) else 0.0 for r in responses]

3. Fine-tuning the model#

Finally, we pass our dataset, model, and reward functions to the GRPOTrainer. Using the Dataiku API, we ensure the newly aligned model version is seamlessly saved back to the project, as a new Saved Model version.

Code 4: model_finetuning.py#

from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig

# Create a fine-tuned LLM version using the Dataiku API
with saved_model.create_finetuned_llm_version(connection_name) as finetuned_llm_version:

    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules="all-linear",
        task_type="CAUSAL_LM",
    )
    
    # Define GRPO training parameters
    training_args = GRPOConfig(
        per_device_train_batch_size=4,
        num_train_epochs=1,
        evaluation_strategy="steps", # Evaluate during training
        eval_steps=50,
        output_dir=finetuned_llm_version.working_directory,
        gradient_checkpointing=True
    )

    grpo_trainer = GRPOTrainer(
        model=model,
        reward_funcs=[correctness_reward_func, format_reward_func], 
        peft_config=peft_config,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset, # Passing validation dataset
        processing_class=tokenizer,
    )

    # Fine-tune the model using verifiable rewards
    grpo_trainer.train()
    
    # Save the model and log metadata back to Dataiku
    grpo_trainer.save_model()
    config = finetuned_llm_version.config
    config["batchSize"] = grpo_trainer.state.train_batch_size
    config["eventLog"] = grpo_trainer.state.log_history

By completing the above steps, you apply the GRPO algorithm to train models on complex reasoning chains.

4: Visual evaluation on the test set#

Once your model finishes training and is registered as a fine-tuned model in the LLM Mesh, you can easily evaluate its new reasoning capabilities! This can done in code, but using Prompt recipes works just as well.

Because we safely held out gsm8k_test from the training process, we can use it to objectively compare performance:

Create a Prompt recipe using your base model (e.g., Qwen 2.5 1.5B) on the gsm8k_test dataset.
Create a second Prompt recipe using your new rlvr_aligned_model on the same gsm8k_test dataset.

By comparing the outputs side-by-side, you should clearly see how the aligned model now actively “thinks” through the math problems step-by-step and strictly outputs its final answer in the correct XML format, vastly improving its accuracy.