Using Spark RAPIDS on Dataiku#

Overview#

This guide provides step-by-step instructions for setting up and configuring the RAPIDS Accelerator for Apache Spark (i.e., Spark RAPIDS) with Dataiku on an Amazon EKS cluster with GPU support. The setup enables GPU acceleration for Spark workloads, significantly improving performance for compatible operations.

Note

This guide assumes familiarity with Dataiku administration, Docker, and basic Kubernetes concepts.

You’ll need administrative access to your Dataiku instance and AWS environment.

Prerequisites#

  • Access to AWS with permissions to create and manage EKS clusters

  • Dataiku DSS instance with administrative access

  • Docker installed and configured

  • SSH access to your Dataiku instance

EKS Cluster Setup#

Requirements#

  • EKS cluster with NVIDIA GPU support

  • Example instance types:

    • P4d EC2 instances

    • G4 EC2 instances

Setup Instructions#

  1. Follow the Dataiku EKS Cluster Setup Guide for initial cluster creation.

  2. Ensure NVIDIA GPU support is properly configured in your cluster.

Building a Custom Spark Docker Image with RAPIDS#

Accessing the Dataiku Instance#

  1. SSH into your Dataiku instance using the Terminal:

    ssh <user>@<your-instance-address>
    

    Note

    If you do not already have access, download wget using the command sudo yum install wget before proceeding to the next step.

  2. Sudo to the dataiku user and switch to the Dataiku user home directory (/data/dataiku)

    sudo -su dataiku
    

Downloading Required Files#

  1. Download the RAPIDS jar file:

    wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.1/rapids-4-spark_2.12-24.06.1.jar
    

    Note

    The link for the latest RAPIDS jar can be found at Download | spark-rapids

  2. Download the GPU discovery script:

    wget -O getGpusResources.sh https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
    

File Configuration#

  1. Move the Rapids jar to the Dataiku Spark directory:

    cp rapids-4-spark_2.12-24.06.1.jar /opt/dataiku-dss-<YOUR VERSION>/spark-standalone-home/jars/
    
  2. Move the GPU discovery script to the Dataiku Spark directory:

    cp getGpusResources.sh /opt/dataiku-dss-<YOUR VERSION>/spark-standalone-home/bin/
    
  3. Make getGpusResources.sh executable:

    chmod +x /opt/dataiku-dss-<YOUR VERSION>/spark-standalone-home/bin/getGpusResources.sh
    

Building the Docker Image#

  1. Build the custom Spark image with CUDA support:

    dss_data/bin/dssadmin build-base-image \
      --type spark \
      --without-r \
      --build-from-image nvidia/cuda:12.6.2-devel-rockylinux8 \
      --tag <add_a_unique_name>
    

Dataiku Spark Configuration#

Creating a Custom Configuration#

  1. Navigate to the Administration panel in Dataiku.

  2. Select Settings > Spark configurations.

  3. Click the + ADD ANOTHER CONFIG button.

Configuration Settings#

Configure the following settings:

Setting

Value

Config name

spark-rapids

Config keys

Default spark configs on the container

Managed Spark-on-K8S

Enabled

Image registry URL

<your-registry>/spark-rapids

Image pre-push hook

Enable push to ECR

Kubernetes namespace

default

Authentication mode

Create service accounts dynamically

Base image

<Replace with your image tag name from Step 2h>

Managed cloud credentials

Enabled

Provide AWS credential

From any AWS connection

Additional Parameters#

Spark configuration parameters like these below can be used:

spark.rapids.sql.enabled=true
spark.rapids.sql.explain=NOT_ON_GPU

Note

Additional properties can be found at Configuration | spark-rapids.

Pushing the Base Image#

  1. Click the PUSH BASE IMAGES button

Congratulations!#

You are now set up to begin using the Spark RAPIDS container.

Best Practices#

  • Regularly update Rapids jar and Delta Core to maintain compatibility

  • Monitor GPU utilization using NVIDIA tools

  • Back up configurations before making changes

  • Test the setup with a sample workload before production use

Known Issues and Troubleshooting#

Missing GPU Resources Script#

Issue: getGpusResources.sh file not found.

Resolution:

wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
mv getGpusResources.sh /opt/dataiku/spark/conf/
chmod +x /opt/dataiku/spark/conf/getGpusResources.sh

Delta Core ClassNotFoundException#

Issue: Delta Core class not found during execution.

Resolution:

  1. Download the correct Delta Core JAR:

    wget https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar
    
  2. Move it to the Spark jars directory:

    mv delta-core_2.12-2.4.0.jar /opt/dataiku/spark/jars/
    

Shim Loader Exception#

Issue: Version mismatch in Shim Loader - java.lang.IllegalArgumentException: Could not find Spark Shim Loader for 3.4.1

Resolution: Ensure compatibility between Rapids JAR and Spark versions

Important

Always verify version compatibility between Spark, Rapids, and Delta Core components before deployment.

Spark Shuffle Manager#

Issue:

RapidsShuffleManager class mismatch (com.nvidia.spark.rapids.spark340.RapidsShuffleManager !=
com.nvidia.spark.rapids.spark341.RapidsShuffleManager). Check that configuration
setting spark.shuffle.manager is correct for the Spark version being used.

Resolution: Ensure that the correct shuffle manager is configured for the Spark version you are using. Mappings can be found in the RAPIDS Shuffle Manager

Additional Resources#