Spark Practice from pyspark book - rohith-nallam/Bigdata GitHub Wiki
Functions learned
- Map - For each iterable element it applies the function provided in the Map
- sortBy - sorts the elements as per the keyfunc
- takeOrdered - It take two parameters num of values you need to store and function you wanna apply
studentMarksDataRDD = sc.parallelize(studentMarksData,4)
studentMarksDataRDD.take(2)
studentMarksMean = studentMarksDataRDD.map(lambda x : [x[0],x[1],(x[2]+x[3])/2])
studentMarksMean.collect()
secondYearMarks = studentMarksMean.filter(lambda x : "year2" in x)
secondYearMarks.toDebugString
sortedMarksData = secondYearMarks.sortBy(keyfunc = lambda x : -x[2])
bottomThreeStudents = secondYearMarks.takeOrdered(num=3, key = lambda x :x[2]])