Created by gh-md-toc
This Git repository features use cases of good and bad practices when using Spark-based tools to process and analyze data.
SPARK_REMOTE
environment variable should not be set at this
stage, otherwise the Spark Connect server will try to connect to the
corresponding Spark Connect server and will therefore not start
$ sparkconnectstart
$ export SPARK_REMOTE="sc://localhost:15002"; pyspark
...
[C 2023-06-27 21:54:04.720 ServerApp]
To access the server, open this file in a browser:
file://$HOME/Library/Jupyter/runtime/jpserver-21219-open.html
Or copy and paste one of these URLs:
http://localhost:8889/lab?token=dd69151c26a3b91fabda4b2b7e9724d13b49561f2c00908d
http://127.0.0.1:8889/lab?token=dd69151c26a3b91fabda4b2b7e9724d13b49561f2c00908d
...
$ open ~/Library/Jupyter/runtime/jpserver-*-open.html
ipython-notebooks/simple-connect.ipynb
+-------+--------+-------+-------+
|User ID|Username|Browser| OS|
+-------+--------+-------+-------+
| 1580| Barry|FireFox|Windows|
| 5820| Sam|MS Edge| Linux|
| 2340| Harry|Vivaldi|Windows|
| 7860| Albert| Chrome|Windows|
| 1123| May| Safari| macOS|
+-------+--------+-------+-------+
SPARK_REMOTE
environment
variable has not been set properly.
There is a try-catch clause, as once the Spark session has been
started through Spark Connect, it cannot be stopped that way;
the first cell may thus be re-executed at will with no further
side-effect on the Spark sessionAs per the official
Apache Spark documentation,
PyPi-installed PySpark (pip install pyspark[connect]
) comes with
Spark Connect from Spark version 3.4 or later. However, as of Spark
version up to 3.4.1, the PySpark installation lacks the two new
administration scripts allowing to start and to stop the Spark Connect
server. For convenience, these two scripts have therefore been copied into
this Git repository, in the tools/
directory. They may then simply
copied in the PySpark sbin
directory, once PySpark has been installed
with pip
$ pip install -U pyspark[connect,sql,pandas_on_spark] plotly pyvis jupyterlab
PY_LIBDIR=”$(python -mpip show pyspark|grep “^Location:”|cut -d’ ‘ -f2,2)” export SPARK_VERSION=”$(python -mpip show pyspark|grep “^Version:”|cut -d’ ‘ -f2,2)” export SPARK_HOME=”$PY_LIBDIR/pyspark” export PATH=”$SPARK_HOME/sbin:$PATH” export PYSPARK_PYTHON=”$(which python3)” export PYSPARK_DRIVER_PYTHON=’jupyter’ export PYSPARK_DRIVER_PYTHON_OPTS=’lab –no-browser –port=8889’
_EOF
* Re-read the Shell init scripts:
$ exec bash
* Copy the two Spark connect administrative scripts into the PySpark
installation:
```bash
$ cp tools/st*-connect*.sh $SPARK_HOME/sbin/
$ ls -lFh $SPARK_HOME/sbin/*connect*.sh
-rwxr-xr-x 1 user staff 1.5K Jun 28 16:54 $PY_LIBDIR/pyspark/sbin/start-connect-server.sh*
-rwxr-xr-x 1 user staff 1.0K Jun 28 16:54 $PY_LIBDIR/pyspark/sbin/stop-connect-server.sh*
alias sparkconnectstart=unset SPARK_REMOTE; start-connect-server.sh –packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION,io.delta:delta-core_2.12:2.4.0 –conf “spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension” –conf “spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog”’ alias sparkconnectstop=’stop-connect-server.sh’
alias pysparkdelta=’pyspark –packages io.delta:delta-core_2.12:2.4.0 –conf “spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension” –conf “spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog”’
_EOF
* Re-read the Shell aliases:
```bash
. ~/.bash_aliases
That section is kept for reference only. It is normally not needed
$ export SPARK_VERSION="3.4.1"
wget https://dlcdn.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
tar zxf spark-$SPARK_VERSION-bin-hadoop3.tgz && \
mv spark-$SPARK_VERSION-bin-hadoop3 ~/ && \
rm -f spark-$SPARK_VERSION-bin-hadoop3.tgz
export SPARK_VERSION=”${SPARK_VERSION}” export SPARK_HOME=”$HOME/spark-$SPARK_VERSION-bin-hadoop3” export PATH=”$SPARK_HOME/bin:$SPARK_HOME/sbin:${PATH}” export PYTHONPATH=$(ZIPS=(“$SPARK_HOME”/python/lib/.zip); IFS=:; echo “${ZIPS[]}”):$PYTHONPATH export PYSPARK_PYTHON=”$(which python3)” export PYSPARK_DRIVER_PYTHON=’jupyter’ export PYSPARK_DRIVER_PYTHON_OPTS=’lab –no-browser –port=8889’
_EOF exec bash
* Add the following Shell aliases to start and stop Spark Connect server:
```bash
$ cat >> ~/.bash_aliases << _EOF
# Spark Connect
alias sparkconnectstart='start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:${SPARK_VERSION}'
alias sparkconnectstop='stop-connect-server.sh'
_EOF
. ~/.bash_aliases