Apache Spark Assignment 1 (Submission) - piyushknoldus/Gargantua GitHub Wiki

**Solutions : **

**Q1. Create an RDD (Resilient Distributed Dataset) named pagecounts from the input file. ** Solution:-

scala> sc

res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@edbc2bf

scala> val pagecounts = sc.textFile("/home/knoldus/Downloads/pagecounts-20151201-220000")

pagecounts: org.apache.spark.rdd.RDD[String] = /home/knoldus/Downloads/pagecounts-20151201-220000 MapPartitionsRDD[3] at textFile at :27

**Q2. Get the 10 records from the data and write the data which is getting printed/displayed. **

Solution:-

scala> val getTenLines = pagecounts.take(10).toList getTenLines: List[String] = List(aa 112_f.Kr 1 4606, aa 2,4-Dinitrophenylhydrazine/en/Brady%27s_test 1 4680, aa 439_f.Kr 1 4605, aa Celestino_Marchant 1 4611, aa Eduard_O%E2%80%99Rourke/de/Eduard_O%27Rourke 1 4688, aa File:Douris_Man_with_wax_tablet.jpg 1 7848, aa File:UpdatedPlanets2006.jpg 1 8839, aa File:Wikipedia_h2g2.jpg 1 8838, aa Islam_in_Tunisia/en/Shi%27a_Islam_in_Tunisia 1 4675, aa Kk/Special:Imagelist 1 4613)

scala> getTenLines foreach (line => println(line))

aa 112_f.Kr 1 4606

aa 2,4-Dinitrophenylhydrazine/en/Brady%27s_test 1 4680

aa 439_f.Kr 1 4605

aa Celestino_Marchant 1 4611

aa Eduard_O%E2%80%99Rourke/de/Eduard_O%27Rourke 1 4688

aa File:Douris_Man_with_wax_tablet.jpg 1 7848

aa File:UpdatedPlanets2006.jpg 1 8839

aa File:Wikipedia_h2g2.jpg 1 8838

aa Islam_in_Tunisia/en/Shi%27a_Islam_in_Tunisia 1 4675

aa Kk/Special:Imagelist 1 4613

**Q3. How many records in total are in the given data set ? **

Solution : - scala> val totalRecords = pagecounts.count()

totalRecords: Long = 7598006

scala> totalRecords

res2: Long = 7598006

**Q4. The first field is the “project code” and contains information about the language of the ** **pages. For example, the project code “en” indicates an English page. Derive an RDD containing ** **only English pages from pagecounts **

Solution : -

scala> val getEnglishRDD = pagecounts.filter { line =>

 | { val firstCol = line.split(" ")(0)

 | firstCol equals "en" 

 | }

 | }

getEnglishRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at :29

**Q5. How many records are there for English pages? **

Solution : -

scala> val getEnglishRecords = getEnglishRDD.count()

getEnglishRecords: Long = 2278417

scala> getEnglishRecords

res3: Long = 2278417

Q6. Find the pages that were requested more than 200,000 times in total.

Solution : -

scala> val findRequestPages = pagecounts.map {

 | line => { val Column = line.split(" ")


 | (Column(1),Column(2).toLong)

 | }

 | }

findRequestPages: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[5] at map at :29

scala> val findCounts = findRequestPages.reduceByKey((x,y) => x+y )

findCounts: org.apache.spark.rdd.RDD[(String, Long)] = ShuffledRDD[6] at reduceByKey at :31

scala> val totalCount = findCounts.filter(line => line._2 > 200000L).count()

totalCount: Long = 11

Apache Spark Assignment 1 (Submission) - piyushknoldus/Gargantua GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️