Skip to main content

Setup in Different Environments

This documentation page will guide you through setting up Jupyter and Spark on a MAC, as well as configuring your Spark application in an Amazon EMR Notebook. Follow the instructions to ensure a seamless installation and configuration process.

Set up Jupyter + Spark on MAC

1. Install Homebrew

Homebrew is an essential package manager for macOS that simplifies the installation of software. To install Homebrew, run the following command in your terminal:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

After the installation, add Homebrew to your PATH:

echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

2. Install Java

Apache Spark requires Java. You can install it using Homebrew:

brew install openjdk@8

For MAC M1 chip users, Azul's Zulu OpenJDK v1.8 is recommended:

wget "https://cdn.azul.com/zulu/bin/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz"
tar -xvf zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz
echo "export JAVA_HOME=$PWD/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64" >> ~/.zprofile
export PATH=$PATH:$JAVA_HOME

3. Install Python and Scala

Install Python and Scala using Homebrew:

brew install python
brew install scala

4. Install PySpark

You have multiple options for installing PySpark:

  • Install with pip
pip install pyspark
# Or specify a version
pip install pyspark==3.1.2
  • Install with Conda
conda install pyspark
# Or specify a version
conda install pyspark=3.1.2 -y
  • Download the binaries
wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvf spark-3.1.2-bin-hadoop3.2.tgz
cd spark-3.1.2-bin-hadoop3.2
export SPARK_HOME=`pwd`
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH

5. Install Jupyter Labs

Install JupyterLab using one of the following methods:

  • Install with pip
pip install jupyterlab
  • Install with Conda
conda install jupyterlab
  • Install with Homebrew
brew install jupyter

6. Start Jupyter Notebook

Start Jupyter Notebook by running:

jupyter-notebook

Access the notebook through the provided URLs in the terminal.

How to Configure Your Spark Application in an Amazon EMR Notebook

Apache Spark: A Comprehensive Distributed Computing Framework

Apache Spark is a powerful distributed computing framework. To create a basic SparkSession:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

Read datasets and transform them:

df = spark.read.json("examples/src/main/resources/people.json")

Spark Configuration: Tailoring Your Spark Environment

Configure the environment for each Spark session. Below are some basic configurations:

ParameterDescriptionValues
jarsJars to be used in the sessionList of string
pyFilesPython files to be used in the sessionList of string
filesFiles to be used in the sessionList of string
driverMemoryMemory for the driver processstring
driverCoresCores for the driver processint
executorMemoryMemory for the executor processstring
executorCoresCores for the executor processint
numExecutorsNumber of executorsint
archivesArchives to be used in the sessionList of string
queueYARN queue namestring
nameSession name (lowercase)string

Amazon EMR: A Robust Solution for Big Data Processing

Amazon EMR is ideal for big data processing, interactive analytics, and machine learning. It hosts an Apache Spark cluster with Apache Hive and Presto, providing elasticity and cost-efficiency.

  • Jupyter Notebook with Spark Settings on EMR

Amazon EMR notebooks use the Sparkmagic kernel. To configure, use:

%%configure -f
{
"executorMemory": "4G"
}

For more specific configurations:

%%configure -f
{
"conf": {
"spark.dynamicAllocation.enabled": "false",
"spark.jars.packages": "io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0",
"spark.sql.extensions": "io.qbeast.spark.internal.QbeastSparkSessionExtension"
}
}

Verify the configuration with:

%%info

Check server logs on the EMR cluster at /var/log/livy/livy-livy-server.out.

This guide ensures your setup is comprehensive and tailored to different environments, enhancing your development experience.