Spark Practice from pyspark book - rohith-nallam/Bigdata GitHub Wiki

Functions learned

  • Map - For each iterable element it applies the function provided in the Map
  • sortBy - sorts the elements as per the keyfunc
  • takeOrdered - It take two parameters num of values you need to store and function you wanna apply

studentMarksDataRDD = sc.parallelize(studentMarksData,4) studentMarksDataRDD.take(2) studentMarksMean = studentMarksDataRDD.map(lambda x : [x[0],x[1],(x[2]+x[3])/2]) studentMarksMean.collect() secondYearMarks = studentMarksMean.filter(lambda x : "year2" in x) secondYearMarks.toDebugString sortedMarksData = secondYearMarks.sortBy(keyfunc = lambda x : -x[2]) bottomThreeStudents = secondYearMarks.takeOrdered(num=3, key = lambda x :x[2]])