PySpark Made Easy - mdjibran/PySparkGuide GitHub Wiki
Index
Transformations
- map()
- filter()
- flatMap()
- intersection()
- distinct()
- groupByKey()
- reduceByKey()
- aggregateByKey()
- sortByKey()
- join()
- cogroup()
- cartesian()
- pipe()
- coalesce()
- repartition()
- repartitionAndSortWithinPartitions()
- mapPartitions()
- mapPartitionsWithIndex()
- sample()
- union()
Actions
- reduce()
- count()
- first()
- take()
- takeSample()
- takeOrdered()
- saveAsTextFile()
- saveAsSequenceFile()
- saveAsObjectFile()
- countByKey()
- foreach()
- collect()
Content
Transformations
1. map()
Purpose:
Syntax:
Input/Output:
USE when:
NOT to USE when:
Example:
text = sc.textFile('/data/sample.csv')
text.first()
2. filter()
Purpose: To obtain a subset of records from a RDD
Syntax:
RDD.filter(lambda i: condition)
RDD.filter(func)
Input/Output: Condition or function/RDD
USE when: A large dataset is to be filtered into smaller chunks based on certain criteria
NOT to USE when:
Example:
# 1
nums = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
RDD = nums.filter(lambda x: x % 2 == 0)
RDD.take(2)
# 2
def Func(x):
  if x%2 == 0:
    return true
  else
    return false
nums = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
RDD = nums.filter(Func)
RDD.take(2)
reduceByKey()
Purpose:
Syntax:
Input/Output:
USE when:
NOT to USE when:
Example:
orderItems = sc.textFile("/data/data-master/retail_db/order_items")
orderItemsMap = orderItems.map(lambda i: (int(i.split(',')[1]), float(i.split(',')[4])))
orderItemsSubTotal = orderItemsMap.reduceByKey(lambda x, y: x + y)
Get minimum value
orderItemsSubTotal = orderItemsMap.reduceByKey(lambda x, y: x if(x < y) else y)
aggregateByKey()
Purpose:
Syntax:
Input/Output:
USE when:
NOT to USE when:
Example:
orderItems = sc.textFile("/data/data-master/retail_db/order_items")
orderItemsMap = orderItems.map(lambda x: (int(x.split(',')[1]), float(x.split(',')[4]) ))
orderGroupedCount = orderItemsMap.aggregateByKey((0.0, 0),
lambda x,y: (x[0]+y, x[1]+1),
lambda x,y: (x[0]+y[0], x[1]+y[1])
)
sortByKey()
Purpose:
Syntax:
Input/Output:
USE when:
NOT to USE when:
Example:
products = sc.textFile("/data/data-master/retail_db/products")
productsMap = products.map(lambda x: (float(x.split(',')[4]) if x.split(',')[4] !='' else float(x.split(',')[5] ), x.split(',')[2]))
sortedProducts = productsMap.sortByKey()
)
# Sort data by product category  and then by product price descending - sortByKey
products = sc.textFile("/data/data-master/retail_db/products")
productsMap = products\
.filter(lambda x: x.split(',')[4] != '')\
.map(lambda x: ((int(x.split(',')[1]), float(x.split(',')[4] )), x.split(',')[2] ))\
.sortByKey()
# To get orderby key (x,y) where x is ascending and y is descending 
products = sc.textFile("/data/data-master/retail_db/products")
productsMap = products\
.filter(lambda x: x.split(',')[4] != '')\
.map(lambda x: ((int(x.split(',')[1]), -float(x.split(',')[4] )), x.split(',')[2] ))\
.sortByKey()
Actions
1. reduce()
Purpose:
Syntax:
Input/Output:
USE when:
NOT to USE when:
Example:
text = sc.textFile('/data/sample.csv')
text.first()
2. count()
Purpose: To get the total number of elements in RDD
Syntax: RDD.count()
Input/Output: -/int
USE when: Need to get total elements
NOT to USE when:
Example:
text = sc.textFile('/data/sample.csv')
text.count()
3. first()
Purpose: To get first element of the RDD
Syntax: RDD.first()
Input/Output: -/RDD[0]
USE when: Get only one element from RDD
NOT to USE when:
Example:
text = sc.textFile('/data/sample.csv')
text.first()
4. take()
Purpose: Returns n records from RDD
Syntax: RDD.take(n),  where n is the number of records
Input/Output: int/int
USE when:
- Dataset is large
- Only a fraction of data is required
NOT to USE when:
Example:
text = sc.textFile('/data/sample.csv')
for i in text.take(10): print i
5. top()
Purpose:
Syntax:
Input/Output:
USE when:
NOT to USE when:
Example:
# Get Top records - top, takeOrdered
filteredProducts = products.filter(lambda x: x.split(',')[4] != '' )
topProducts = filteredProducts.top(5, key=lambda x: float(x.split(',')[4] ))
# above is similar as writing below statement with takeOrdered
takeOrderedProducts = filteredProducts.takeOrdered(5, key=lambda x: -float(x.split(',')[4] ))
5. takeOrdered()
Purpose:
Syntax:
Input/Output:
USE when:
NOT to USE when:
Example:
# Get Top records - top, takeOrdered
filteredProducts = products.filter(lambda x: x.split(',')[4] != '' )
topProducts = filteredProducts.top(5, key=lambda x: float(x.split(',')[4] ))
# above is similar as writing below statement with takeOrdered
takeOrderedProducts = filteredProducts.takeOrdered(5, key=lambda x: -float(x.split(',')[4] ))