Spark - BKJackson/BKJackson_Wiki GitHub Wiki

Visualization of Spark Architecture (from Spark API)
What is Spark?
Spark is a general-purpose distributed data processing engine
RDD = Resilient Distributed Datasets. RDDs hide all the complexity of transforming and distributing your data automatically across multiple nodes by a scheduler if you’re running on a cluster.
PySpark
In a Python context, think of PySpark as a way to handle parallel processing without the need for the threading or multiprocessing modules. All of the complicated communication and synchronization between threads, processes, and even different CPUs is handled by Spark. PySpark Intro
SparkContext for a local machine
The entry-point of any PySpark program is a SparkContext object. This object allows you to connect to a Spark cluster and create RDDs. The local[*] string is a special string denoting that you’re using a local cluster, which is another way of saying you’re running in single-machine mode. The * tells Spark to create as many worker threads as logical cores on your machine.
import pyspark
sc = pyspark.SparkContext('local[*]')
SparkContext for a cluster
Creating a SparkContext can be more involved when you’re using a cluster. To connect to a Spark cluster, you might need to handle authentication and a few other pieces of information specific to your cluster. You can set up those details similarly to the following:
conf = pyspark.SparkConf()
conf.setMaster('spark://head_node:56887')
conf.set('spark.authenticate', True)
conf.set('spark.authenticate.secret', 'secret-key')
sc = SparkContext(conf=conf)
filter(), map(), and reduce()
filter() - filters items out of an iterable based on a condition, typically expressed as a lambda function. filter() takes an iterable, calls the lambda function on each item, and returns the items in an iterable where the lambda returned True.
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(list(filter(lambda arg: len(arg) < 8, x)))
['Python', 'is']
map() is similar to filter() in that it applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items. The new iterable that map() returns will always have the same number of elements as the original iterable, which was not the case with filter(). map() automatically calls the lambda function on all the items, effectively replacing a for loop.
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(list(map(lambda arg: arg.upper(), x)))
['PYTHON', 'PROGRAMMING', 'IS', 'AWESOME!']
reduce()applies a function to elements in an iterable. However, reduce() doesn’t return a new iterable. Instead, reduce() uses the function called to reduce the iterable to a single value. There is no call to list() here because reduce() already returns a single item. Note: Python 3.x moved the built-in reduce() function into the functools package.
>>> from functools import reduce
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(reduce(lambda val1, val2: val1 + val2, x))
Pythonprogrammingisawesome!
Spark SQL
Reading CSV files into a Spark dataframe
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data_path = '/MyPath/data'
file_path = os.path.join(data_path, 'location_temp.csv')
df1 = spark.read.format("csv").option("header", "true").load(file_path)
Reading JSON files
json_df2_path = data_path + "/utilization.json"
df2 = spark.read.format("json").load(json_df2_path)
Display dataframe data
df2.head(3)
Output:
[Row(cpu_utilization=0.57, event_datetime='03/05/2019 08:06:14', free_memory=0.51, server_id=100, session_count=47),
Row(cpu_utilization=0.47, event_datetime='03/05/2019 08:11:14', free_memory=0.62, server_id=100, session_count=43),
Row(cpu_utilization=0.56, event_datetime='03/05/2019 08:16:14', free_memory=0.57, server_id=100, session_count=62)]
Display dataframe data as a table
df2.show(3)
Output:
+---------------+-------------------+-----------+---------+-------------+
|cpu_utilization| event_datetime|free_memory|server_id|session_count|
+---------------+-------------------+-----------+---------+-------------+
| 0.57|03/05/2019 08:06:14| 0.51| 100| 47|
| 0.47|03/05/2019 08:11:14| 0.62| 100| 43|
| 0.56|03/05/2019 08:16:14| 0.57| 100| 62|
+---------------+-------------------+-----------+---------+-------------+
only showing top 3 rows
Print schema
df.printSchema()
Output:
root
|-- cpu_utilization: double (nullable = true)
|-- event_datetime: string (nullable = true)
|-- free_memory: double (nullable = true)
|-- server_id: long (nullable = true)
|-- session_count: long (nullable = true)
Create dataframe manually
df_dup = sc.parallelize([Row(server_name='101 Server', cpu_utilization=85, session_count=80), \
Row(server_name='101 Server', cpu_utilization=80, session_count=90),
Row(server_name='102 Server', cpu_utilization=85, session_count=80),
Row(server_name='102 Server', cpu_utilization=85, session_count=80)]).toDF()
Sort dataframe by column
df2_sort = df2_sample.sort('event_datetime')
Show summary statistics
df2.describe().show()
Output:
+-------+-------------------+-------------------+-------------------+------------------+------------------+
|summary| cpu_utilization| event_datetime| free_memory| server_id| session_count|
+-------+-------------------+-------------------+-------------------+------------------+------------------+
| count| 500000| 500000| 500000| 500000| 500000|
| mean| 0.6205177400000123| null| 0.3791280999999977| 124.5| 69.59616|
| stddev|0.15875173872912837| null|0.15830931278376217|14.430884120553255|14.850676696352865|
| min| 0.22|03/05/2019 08:06:14| 0.0| 100| 32|
| max| 1.0|04/09/2019 01:22:46| 0.78| 149| 105|
+-------+-------------------+-------------------+-------------------+------------------+------------------+
Calculate correlation between columns
df_util.stat.corr('free_memory', 'cpu_utilization')
Calculate item frequency
df_util.stat.freqItems(('server_id','session_count')).show()
Get number of rows
df2.count()
Group by one column and calculate mean of another column
df1.groupby('location_id').agg({'temp_celcius': 'mean'}).show()
Save dataframe to csv
df1.write.csv('df1.csv')
Save dataframe to json
df1.write.json('df1.json')
Working with null values
Add an empty column full of "null" strings
df_na = df.withColumn('na_col', lit(None).cast(StringType()))
Drop rows with null values
df2.na.drop()
Run SQL queries on dataframe
Start by giving the table a name "utilization":
df.createOrReplaceTempView("utilization")
# Select all columns
df_sql = spark.sql("SELECT * FROM utilization LIMIT 10")
# Select two columns
df_sql = spark.sql("SELECT server_id, session_count FROM utilization LIMIT 10"
# Select two columns and give them alias names
df_sql = spark.sql("SELECT server_id as sid, session_count as sc FROM utilization")
# Select all where server_id = 120
df_sql = spark.sql("SELECT * FROM utilization WHERE server_id = 120")
# Writing longer SQL statements
df_sql = spark.sql("SELECT server_id, session_count \
FROM utilization \
WHERE session_count > 70 AND server_id = 120 \
ORDER BY session_count DESC"
# Select unique server_id values with DISTINCT
df_count = spark.sql("SELECT DISTINCT server_id FROM utilization ORDER BY server_id"
# Get count of number of rows in the table
df_count = spark.sql("SELECT count(*) FROM utilization")
# Get count of rows where session_count > 70 and group by server_id and order by descending count
df_sql = spark.sql("SELECT server_id, count(*) \
FROM utilization \
WHERE session_count > 70 \
GROUP BY server_id" \
ORDER BY count(*) DESC"))
# Create some summary statistics and round the output to 2 decimal places
df_sql = spark.sql("SELECT server_id, min(session_count), round(avg(session_count),2), max(session_count) \
FROM utilization \
WHERE session_count > 70 \
GROUP BY server_id \
ORDER BY count(*) DESC")
# Join two tables by server_id
df_join = spark.sql("SELECT u.server_id, sn.server_name, u.session_count \
FROM utilization u \
INNER JOIN server_name sn \
ON sn.server_id = u.server_id")
df_sql.show()
Alternative way to query and view data
sql_window2 = "SELECT event_datetime, server_id, cpu_utilization, \
avg(cpu_utilization) OVER (PARTITION BY server_id) avg_server_util, \
cpu_utilization - avg(cpu_utilization) OVER (PARTITION BY server_id) delta_server_util \
FROM utilization"
spark.sql(sql_window2).show()
Spark for Machine Learning and AI
Driver host parameter setting for pyspark.ml
spark = SparkSession.builder.config("spark.driver.host","127.0.0.1").config("spark.driver.bindAddress","127.0.0.1").getOrCreate()
Articles
A Neanderthal’s Guide to Apache Spark in Python - Tutorial on Getting Started with PySpark for Complete Beginners, June 14, 2019
First Steps With PySpark and Big Data Processing - Real Python, July 31 2019