Spark on Kubernetes Python and R bindings
The version 2.4
of Spark for Kubernetes introduces Python and R bindings.
spark-py
: The Spark image with Python bindings (including Python 2 and 3 executables)spark-r
: The Spark image with R bindings (including R executable)
Databricks has published an article dedicated to the Spark 2.4
features for Kubernetes.
It’s exactly the same principle as already explained in my previous article. But this time we are using:
- A different image:
spark-py
- Another example:
local:///opt/spark/examples/src/main/python/pi.py
, once again a Pi computation :-| - A dedicated Spark namespace:
spark.kubernetes.namespace=spark
The namespace must be created first:
$ k create namespace spark
namespace "spark" created
It permits to isolate the Spark pods from the rest of the cluster and could be used later to cap available resources.
$ cd $SPARK_HOME
$ ./bin/spark-submit \
--master k8s://https://localhost:6443 \
--deploy-mode cluster \
--name spark-pi \
--conf spark.executor.instances=2 \
--conf spark.driver.memory=512m \
--conf spark.executor.memory=512m \
--conf spark.kubernetes.container.image=spark-py:v2.4.0 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.namespace=spark \
local:///opt/spark/examples/src/main/python/pi.py
spark.kubernetes.pyspark.pythonVersion
is an additional (an optional) property that can be used to select the major Python version to use (it’s 2
by default).
This sets the major Python version of the docker image used to run the driver and executor containers. Can either be 2 or 3.
Labels
An interesting that has nothing to do with Python is that Spark defines labels that are applied on pods. They permit to easily identify the role of each pod.
$ k get po -L spark-app-selector,spark-role -n spark
NAME READY STATUS RESTARTS AGE SPARK-APP-SELECTOR SPARK-ROLE
spark-pi-1545987715677-driver 1/1 Running 0 12s spark-c4e28a2ef3d14cfda16c007383318c79 driver
spark-pi-1545987715677-exec-1 0/1 ContainerCreating 0 1s spark-application-1545987726694 executor
spark-pi-1545987715677-exec-2 0/1 ContainerCreating 0 1s spark-application-1545987726694 executor
You can for example use the label to delete all the terminated driver pods.
# Can also decide to switch from the default to the spark namespace
# $ k config set-context $(kubectl config current- context) --namespace spark
$ k delete po -l spark-role=driver -n spark
# Or you can delete the whole namespace
$ k delete ns spark