databricks-examples

Examples of code with DataBricks

Table of Content (ToC)

Created by gh-md-toc

Overview

This Git repository features use cases of good and bad practices when using Spark-based tools to process and analyze data.

References

Spark

Spark Connect

Python

Jupyter

Quick start

Use cases

Initial setup

PySpark and Jupyter

Spark

PY_LIBDIR=”$(python -mpip show pyspark|grep “^Location:”|cut -d’ ‘ -f2,2)” export SPARK_VERSION=”$(python -mpip show pyspark|grep “^Version:”|cut -d’ ‘ -f2,2)” export SPARK_HOME=”$PY_LIBDIR/pyspark” export PATH=”$SPARK_HOME/sbin:$PATH” export PYSPARK_PYTHON=”$(which python3)” export PYSPARK_DRIVER_PYTHON=’jupyter’ export PYSPARK_DRIVER_PYTHON_OPTS=’lab –no-browser –port=8889’

_EOF


* Re-read the Shell init scripts:

$ exec bash


* Copy the two Spark connect administrative scripts into the PySpark
  installation:
```bash
$ cp tools/st*-connect*.sh $SPARK_HOME/sbin/

Spark Connect

alias sparkconnectstart=unset SPARK_REMOTE; start-connect-server.sh –packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION,io.delta:delta-core_2.12:2.4.0 –conf “spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension” –conf “spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog”’ alias sparkconnectstop=’stop-connect-server.sh’

PySpark

alias pysparkdelta=’pyspark –packages io.delta:delta-core_2.12:2.4.0 –conf “spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension” –conf “spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog”’

_EOF


* Re-read the Shell aliases:
```bash
. ~/.bash_aliases

Install native Spark manually

Spark

export SPARK_VERSION=”${SPARK_VERSION}” export SPARK_HOME=”$HOME/spark-$SPARK_VERSION-bin-hadoop3” export PATH=”$SPARK_HOME/bin:$SPARK_HOME/sbin:${PATH}” export PYTHONPATH=$(ZIPS=(“$SPARK_HOME”/python/lib/.zip); IFS=:; echo “${ZIPS[]}”):$PYTHONPATH export PYSPARK_PYTHON=”$(which python3)” export PYSPARK_DRIVER_PYTHON=’jupyter’ export PYSPARK_DRIVER_PYTHON_OPTS=’lab –no-browser –port=8889’

_EOF exec bash


* Add the following Shell aliases to start and stop Spark Connect server:
```bash
$ cat >> ~/.bash_aliases << _EOF

# Spark Connect
alias sparkconnectstart='start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:${SPARK_VERSION}'
alias sparkconnectstop='stop-connect-server.sh'

_EOF
. ~/.bash_aliases