Lab 1 - PoojaShekhar/CS5543-Real-Time-Big-Data-Analytics--Lab-assignments GitHub Wiki
Question
Spark Programming: Write a program (Java or Scala) using IntelliJ for counting sentence and displaying it in a sorted order.
Configuration & Implementation
- Spark was run in local mode on a quad core processor with 16 GB of RAM with 2 local threads - "minimal parallelism".
- A text file was read as input in sparkcontext as RDD[String].
- Sentences were extracted using split function with period and blank line(.\n) as a delimiter.
flatmap() transformation was used for this.
flatMap(line=>{line.split(".\n")})
- Each sentence were counted as one in the next stage.
map() was used to perform this.
map(word=>(word,1))
- All same sentences were grouped together in this reduce phase
reduceByKey() was used for this action.
reduceByKey(_+_)
6 Output sentences were sorted based on the first letter of each sentence (Key),and outputs of all partitions were coalesced together to 1 and the resultant rdds were saved in a textfile. Sequence of following actions were used : sortBy(_._1).coalesce(1).saveAsTextFile("output")
Screenshots
## Input file
## Code & Console Output
## Output File