Using Spark RAPIDS on Dataiku#
Overview#
This guide provides step-by-step instructions for setting up and configuring the RAPIDS Accelerator for Apache Spark (i.e., Spark RAPIDS) with Dataiku on an Amazon EKS cluster with GPU support. The setup enables GPU acceleration for Spark workloads, significantly improving performance for compatible operations.
Note
This guide assumes familiarity with Dataiku administration, Docker, and basic Kubernetes concepts.
You’ll need administrative access to your Dataiku instance and AWS environment.
Prerequisites#
Access to AWS with permissions to create and manage EKS clusters
Dataiku DSS instance with administrative access
Docker installed and configured
SSH access to your Dataiku instance
EKS Cluster Setup#
Requirements#
EKS cluster with NVIDIA GPU support
Example instance types:
P4d EC2 instances
G4 EC2 instances
Setup Instructions#
Follow the Dataiku EKS Cluster Setup Guide for initial cluster creation.
Ensure NVIDIA GPU support is properly configured in your cluster.
Building a Custom Spark Docker Image with RAPIDS#
Accessing the Dataiku Instance#
SSH into your Dataiku instance using the Terminal:
ssh <user>@<your-instance-address>
Note
If you do not already have access, download
wget
using the commandsudo yum install wget
before proceeding to the next step.Sudo to the dataiku user and switch to the Dataiku user home directory (
/data/dataiku
)sudo -su dataiku
Downloading Required Files#
Download the RAPIDS jar file:
wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.1/rapids-4-spark_2.12-24.06.1.jar
Note
The link for the latest RAPIDS jar can be found at Download | spark-rapids
Download the GPU discovery script:
wget -O getGpusResources.sh https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
File Configuration#
Move the Rapids jar to the Dataiku Spark directory:
cp rapids-4-spark_2.12-24.06.1.jar /opt/dataiku-dss-<YOUR VERSION>/spark-standalone-home/jars/
Move the GPU discovery script to the Dataiku Spark directory:
cp getGpusResources.sh /opt/dataiku-dss-<YOUR VERSION>/spark-standalone-home/bin/
Make
getGpusResources.sh
executable:chmod +x /opt/dataiku-dss-<YOUR VERSION>/spark-standalone-home/bin/getGpusResources.sh
Building the Docker Image#
Build the custom Spark image with CUDA support:
dss_data/bin/dssadmin build-base-image \ --type spark \ --without-r \ --build-from-image nvidia/cuda:12.6.2-devel-rockylinux8 \ --tag <add_a_unique_name>
Dataiku Spark Configuration#
Creating a Custom Configuration#
Navigate to the Administration panel in Dataiku.
Select Settings > Spark configurations.
Click the + ADD ANOTHER CONFIG button.
Configuration Settings#
Configure the following settings:
Setting |
Value |
---|---|
Config name |
|
Config keys |
|
Managed Spark-on-K8S |
|
Image registry URL |
|
Image pre-push hook |
|
Kubernetes namespace |
|
Authentication mode |
|
Base image |
|
Managed cloud credentials |
|
Provide AWS credential |
|
Additional Parameters#
Spark configuration parameters like these below can be used:
spark.rapids.sql.enabled=true
spark.rapids.sql.explain=NOT_ON_GPU
Note
Additional properties can be found at Configuration | spark-rapids.
Pushing the Base Image#
Click the PUSH BASE IMAGES button
Congratulations!#
You are now set up to begin using the Spark RAPIDS container.
Best Practices#
Regularly update Rapids jar and Delta Core to maintain compatibility
Monitor GPU utilization using NVIDIA tools
Back up configurations before making changes
Test the setup with a sample workload before production use
Known Issues and Troubleshooting#
Missing GPU Resources Script#
Issue: getGpusResources.sh
file not found.
Resolution:
wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
mv getGpusResources.sh /opt/dataiku/spark/conf/
chmod +x /opt/dataiku/spark/conf/getGpusResources.sh
Delta Core ClassNotFoundException#
Issue: Delta Core class not found during execution.
Resolution:
Download the correct Delta Core JAR:
wget https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar
Move it to the Spark jars directory:
mv delta-core_2.12-2.4.0.jar /opt/dataiku/spark/jars/
Shim Loader Exception#
Issue: Version mismatch in Shim Loader - java.lang.IllegalArgumentException: Could not find Spark Shim Loader for 3.4.1
Resolution: Ensure compatibility between Rapids JAR and Spark versions
Important
Always verify version compatibility between Spark, Rapids, and Delta Core components before deployment.
Spark Shuffle Manager#
Issue:
RapidsShuffleManager class mismatch (com.nvidia.spark.rapids.spark340.RapidsShuffleManager !=
com.nvidia.spark.rapids.spark341.RapidsShuffleManager). Check that configuration
setting spark.shuffle.manager is correct for the Spark version being used.
Resolution: Ensure that the correct shuffle manager is configured for the Spark version you are using. Mappings can be found in the RAPIDS Shuffle Manager