Configure PySpark to connect to a Standalone Spark Cluster
In one of my previous article I talked about running a Standalone Spark Cluster inside Docker containers through the usage of docker-spark. I was using it with R Sparklyr framework.
However if you want to use from a Python environment in an interactive mode (like in Jupyter notebooks where the driver runs on the local machine while the workers run in the cluster), you have several steps to follow.
- You need to run the same Python version on the driver and on the workers.
- You need to configure PySpark to use the same Spark version, see how to use findspark in my article.
Use the same Python version
The docker-spark containers run a Python 3.5
version. If you want to interact with it with from an external Jupyter notebook running on your machine you have to run a Kernel with the same version.
|
You can now choose your “Python 3.5” Kernel to run PySpark.
If you use another version say 3.7
for example you will see an explicit error.
|
There is another step to follow.
You have to specify the path of the python executable to use in the worker. Without this setting it will try to use the same as in the driver so something like /Users/xxxx/anaconda3/envs/py35/bin/python
. And you will get this kind of error in the logs.
|
In the Dockerfile
of docker-spark we can see that the python environment is available unuder /usr/bin/python
|
So we have to specify it through the environment variable PYSPARK_PYTHON
. See here for all the available environment variables.
|
Putting it all together
That’s it!
|