Spark Libraries - salmanbaig8/imp GitHub Wiki

Spark SQL, Spark streaming, MLIB(machine learning), GraphX{Apache spark}

Spark SQL: Relational queries expressed in SQL, HiveQl, Scala Schema RDD -Row Objects -Schema -Created from: Existing RDD, Parquet file, JSON dataset, HiveQL against Apache Hive SQL context: - created from a sparkContext Java=JavaSparkContext/ JavaSQLcontext

Create Schema RDD data source:

uses reflection -programmatic interface and apply to existing RDD gives more contrl if we dont know the RDD

Using Reflection: -Creating RDD of the person object val people = sc.textFile("peoples.txt").map(_.split(",")).map(p=> person(p(0), p(1).trim.toInt) -Register RDD as table people.registerTemplate("people") -Run SQL statements usng the sql method provided by the SQL context val teenagers = sqlContext.sql("SELECT name from people where age >= 13 and age <= 19")

The result of the queries are SchemaRDD. Normal RDD operations also work on them teenagers.map(t => "name: " + t(0)).collect().foreach(println)

Programmatic Interface: -Creating RDD val people = sc.textFile() -3 steps to create SchemaRDD: 1.Create an RDD of rows from original RDD val schemaString = "name age" 2.Create the schema represented by a StructType matching the structure of the Rows in the RDD from step 1 val schema = StructType(schemaString.split("").map(fieldname =>StructField(fieldName, StringType, true))) 3.Apply schema to the RDD of Rows using the applySchema method val rowRDD = people.map(_.split(",")).map(p => Row(p,0), p(1).trim)) val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema) -Register RDD as table people.registerTemplate("people") -Run SQL statements usng the sql method provided by the SQL context val results = sqlContext.sql("select Name from people") results.map(t => "name: " + t(0)).collect().foreach(println)

Spark Streaming: .scalable , high-throughput, fault tolerant stream processing of high data streams .Recieve high i/p data and divides into small batches which are processed and returned as batches .Dstream - sequence of RDD .Supports scala and Java .Receives data from kafka,Flume,HDFS/S3,Kinesis,twitter .Pushes data to HDFS/Databses/Dashboard

How it works ?? .the i/p stream (Dstream) goes into Spark streaming .breaks up into batches .feeds into spark engine for processing .Generate the final results in streams of batches

Mlib(in development)(contains) -Common algos and utilities .classification, regression,clustering,collaborative filtering,dimensonality reduction

Graphx: _Graphs and graph parallel computation -social networks and language modelling