Python SPark example - unix1998/technical_notes GitHub Wiki

To save PySpark code as a Python script and run it, don't need to import Spark itself directly because PySpark scripts implicitly assume the necessary Spark context will be available when the script is executed using spark-submit or within a PySpark environment.

Here’s a complete example script with proper imports and setup that can save and run as a Python file (e.g., example.py):

example.py

from pyspark import SparkConf, SparkContext

# Initialize Spark context
conf = SparkConf().setAppName("ExampleApp")
sc = SparkContext(conf=conf)

# Sample data: a list of numbers
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Parallelize the data
distData = sc.parallelize(data)

# Filter the data to keep only even numbers
evenNumbers = distData.filter(lambda x: x % 2 == 0)

# Sum the filtered even numbers
sumEvenNumbers = evenNumbers.reduce(lambda x, y: x + y)

# Collect the filtered even numbers
collectedEvenNumbers = evenNumbers.collect()

# Print the results
print("Even Numbers: ", collectedEvenNumbers)
print("Sum of Even Numbers: ", sumEvenNumbers)

# Stop the Spark context
sc.stop()

Running the Script

To run this script, we need to use the spark-submit command, which will handle setting up the necessary Spark context:

spark-submit example.py

Explanation

  • Imports: Import SparkConf and SparkContext from pyspark.
  • SparkConf: Create a Spark configuration object and set the application name.
  • SparkContext: Initialize the Spark context with the configuration.
  • Stop the Spark Context: Always stop the Spark context at the end of script to free up resources.

This script initializes the Spark context, performs the transformations and actions, and then prints the results. By running it with spark-submit, need ensure that Spark is properly initialized and the script runs in the correct environment.