Spark Dataset - keshavbaweja-git/guides GitHub Wiki

Dataset API provides a type-safe, object oriented programming interface for structured data
Dataset API provides compile time safety (Typed API)
Dataframe API is the UnTyped API
Dataframe is an alias for Dataset<Row>
Behind the APIs, Catalyst optimizer(query planner) and Tungsten execution engine(in memory encoding) optimize applications
Dataset is an immutable collection of strongly typed objects
Encoder is responsible for converting between JVM objects and Spark's internal Tungsten binary format, allowing for operations on serialized data and improved memory and time utilization
Dataset is richer API than RDD
Encoders are highly optimized and use runtime code generation to build custom bytecode for serialization and deserialization. As a result, they can operate significantly faster than Java or Kryo serialization.

API

Create a Dataset
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class));
Transform a Dataset
Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
Untyped operations using Column and functions
Column ageColPlus10 = people.col("age").plus(10);