Environment Setup

Setup in Different Environments

This documentation page will guide you through setting up Jupyter and Spark on a MAC, as well as configuring your Spark application in an Amazon EMR Notebook. Follow the instructions to ensure a seamless installation and configuration process.

Set up Jupyter + Spark on MAC

Install Homebrew

Homebrew is an essential package manager for macOS that simplifies the installation of software. To install Homebrew, run the following command in your terminal:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

After the installation, add Homebrew to your PATH:

echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

Install Java

Apache Spark requires Java. You can install it using Homebrew:

brew install openjdk@8

For MAC M1 chip users, Azul's Zulu OpenJDK v1.8 is recommended:

wget "https://cdn.azul.com/zulu/bin/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz"
tar -xvf zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64.tar.gz
echo "export JAVA_HOME=$PWD/zulu8.64.0.19-ca-jdk8.0.345-macosx_aarch64" >> ~/.zprofile
export PATH=$PATH:$JAVA_HOME

Install Python and Scala

Install Python and Scala using Homebrew:

brew install python
brew install scala

Install PySpark

You have multiple options for installing PySpark:

Install with pip

pip install pyspark
# Or specify a version
pip install pyspark==3.1.2

Install with Conda

conda install pyspark
# Or specify a version
conda install pyspark=3.1.2 -y

Download the binaries

wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvf spark-3.1.2-bin-hadoop3.2.tgz
cd spark-3.1.2-bin-hadoop3.2
export SPARK_HOME=`pwd`
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH

Install Jupyter Labs

Install JupyterLab using one of the following methods:

Install with pip

pip install jupyterlab

Install with Conda

conda install jupyterlab

Install with Homebrew

brew install jupyter

6. Start Jupyter Notebook

Start Jupyter Notebook by running:

jupyter-notebook

Access the notebook through the provided URLs in the terminal.

How to Configure Your Spark Application in an Amazon EMR Notebook

Apache Spark: A Comprehensive Distributed Computing Framework

Apache Spark is a powerful distributed computing framework. To create a basic SparkSession:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Read datasets and transform them:

df = spark.read.json("examples/src/main/resources/people.json")

Spark Configuration: Tailoring Your Spark Environment

Configure the environment for each Spark session. Below are some basic configurations:

Parameter	Description	Values
jars	Jars to be used in the session	List of string
pyFiles	Python files to be used in the session	List of string
files	Files to be used in the session	List of string
driverMemory	Memory for the driver process	string
driverCores	Cores for the driver process	int
executorMemory	Memory for the executor process	string
executorCores	Cores for the executor process	int
numExecutors	Number of executors	int
archives	Archives to be used in the session	List of string
queue	YARN queue name	string
name	Session name (lowercase)	string

Amazon EMR: A Robust Solution for Big Data Processing

Amazon EMR is ideal for big data processing, interactive analytics, and machine learning. It hosts an Apache Spark cluster with Apache Hive and Presto, providing elasticity and cost-efficiency.

Jupyter Notebook with Spark Settings on EMR

Amazon EMR notebooks use the Sparkmagic kernel. To configure, use:

{
    "executorMemory": "4G"
}

For more specific configurations:

{
    "conf": {
        "spark.dynamicAllocation.enabled": "false",
        "spark.jars.packages": "io.qbeast:qbeast-spark_2.12:0.2.0,io.delta:delta-core_2.12:1.0.0",
        "spark.sql.extensions": "io.qbeast.spark.internal.QbeastSparkSessionExtension"
    }
}

Verify the configuration with:

%%info

Check server logs on the EMR cluster at /var/log/livy/livy-livy-server.out.

This guide ensures your setup is comprehensive and tailored to different environments, enhancing your development experience.

Overview

Ecosystem