Apache Spark SQL programming guide notes 1 - vaquarkhan/Apache-Kafka-poc-and-notes GitHub Wiki
-
Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
-
DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database.
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:
val df = spark.read.json("/home/osboxes/Sparkdatafile/employee.json")
df.show()
DataFrame Operations
// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show()
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+
// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()
// Select people older than 21
df.filter($"age" > 21).show()
// Count people by age
df.groupBy("age").count().show()
-
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.
case class Person(name: String, age: Long) // Encoders are created for case classes val caseClassDS = Seq(Person("Andy", 32)).toDS() caseClassDS.show() // Encoders for most common types are automatically provided by importing spark.implicits._ val primitiveDS = Seq(1, 2, 3).toDS() primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) // DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name val path = "examples/src/main/resources/people.json" val peopleDS = spark.read.json(path).as[Person] peopleDS.show()
-
Spark SQL introduces a tabular data abstraction called Dataset (that was previously DataFrame).
Spark SQL defines three types of functions:
-
Built-in functions or User-Defined Functions (UDFs) that take values from a single row as input to generate a single return value for every input row.
-
Aggregate functions that operate on a group of rows and calculate a single return value per group.
-
Windowed Aggregates (Windows) that operate on a group of rows and calculate a single return value for each row in a group.