Lab 1 - PoojaShekhar/CS5543-Real-Time-Big-Data-Analytics--Lab-assignments GitHub Wiki

Question

Spark Programming: Write a program (Java or Scala) using IntelliJ for counting sentence and displaying it in a sorted order.

Configuration & Implementation

  1. Spark was run in local mode on a quad core processor with 16 GB of RAM with 2 local threads - "minimal parallelism".
  2. A text file was read as input in sparkcontext as RDD[String].
  3. Sentences were extracted using split function with period and blank line(.\n) as a delimiter. flatmap() transformation was used for this. flatMap(line=>{line.split(".\n")})
  4. Each sentence were counted as one in the next stage. map() was used to perform this. map(word=>(word,1))
  5. All same sentences were grouped together in this reduce phase reduceByKey() was used for this action. reduceByKey(_+_) 6 Output sentences were sorted based on the first letter of each sentence (Key),and outputs of all partitions were coalesced together to 1 and the resultant rdds were saved in a textfile. Sequence of following actions were used : sortBy(_._1).coalesce(1).saveAsTextFile("output")

Screenshots

## Input file Input file

## Code & Console Output Code

## Output File Output