Apache Spark Assignment 1 (Submission) - piyushknoldus/Gargantua GitHub Wiki
**Solutions : **
**Q1. Create an RDD (Resilient Distributed Dataset) named pagecounts from the input file. ** Solution:-
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@edbc2bf
scala> val pagecounts = sc.textFile("/home/knoldus/Downloads/pagecounts-20151201-220000")
pagecounts: org.apache.spark.rdd.RDD[String] = /home/knoldus/Downloads/pagecounts-20151201-220000 MapPartitionsRDD[3] at textFile at :27
**Q2. Get the 10 records from the data and write the data which is getting printed/displayed. **
Solution:-
scala> val getTenLines = pagecounts.take(10).toList getTenLines: List[String] = List(aa 112_f.Kr 1 4606, aa 2,4-Dinitrophenylhydrazine/en/Brady%27s_test 1 4680, aa 439_f.Kr 1 4605, aa Celestino_Marchant 1 4611, aa Eduard_O%E2%80%99Rourke/de/Eduard_O%27Rourke 1 4688, aa File:Douris_Man_with_wax_tablet.jpg 1 7848, aa File:UpdatedPlanets2006.jpg 1 8839, aa File:Wikipedia_h2g2.jpg 1 8838, aa Islam_in_Tunisia/en/Shi%27a_Islam_in_Tunisia 1 4675, aa Kk/Special:Imagelist 1 4613)
scala> getTenLines foreach (line => println(line))
aa 112_f.Kr 1 4606
aa 2,4-Dinitrophenylhydrazine/en/Brady%27s_test 1 4680
aa 439_f.Kr 1 4605
aa Celestino_Marchant 1 4611
aa Eduard_O%E2%80%99Rourke/de/Eduard_O%27Rourke 1 4688
aa File:Douris_Man_with_wax_tablet.jpg 1 7848
aa File:UpdatedPlanets2006.jpg 1 8839
aa File:Wikipedia_h2g2.jpg 1 8838
aa Islam_in_Tunisia/en/Shi%27a_Islam_in_Tunisia 1 4675
aa Kk/Special:Imagelist 1 4613
**Q3. How many records in total are in the given data set ? **
Solution : - scala> val totalRecords = pagecounts.count()
totalRecords: Long = 7598006
scala> totalRecords
res2: Long = 7598006
**Q4. The first field is the “project code” and contains information about the language of the ** **pages. For example, the project code “en” indicates an English page. Derive an RDD containing ** **only English pages from pagecounts **
Solution : -
scala> val getEnglishRDD = pagecounts.filter { line =>
| { val firstCol = line.split(" ")(0)
| firstCol equals "en"
| }
| }
getEnglishRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at :29
**Q5. How many records are there for English pages? **
Solution : -
scala> val getEnglishRecords = getEnglishRDD.count()
getEnglishRecords: Long = 2278417
scala> getEnglishRecords
res3: Long = 2278417
Q6. Find the pages that were requested more than 200,000 times in total.
Solution : -
scala> val findRequestPages = pagecounts.map {
| line => { val Column = line.split(" ")
| (Column(1),Column(2).toLong)
| }
| }
findRequestPages: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[5] at map at :29
scala> val findCounts = findRequestPages.reduceByKey((x,y) => x+y )
findCounts: org.apache.spark.rdd.RDD[(String, Long)] = ShuffledRDD[6] at reduceByKey at :31
scala> val totalCount = findCounts.filter(line => line._2 > 200000L).count()
totalCount: Long = 11