Spark Dataset - keshavbaweja-git/guides GitHub Wiki
- Dataset API provides a type-safe, object oriented programming interface for structured data
- Dataset API provides compile time safety (Typed API)
- Dataframe API is the UnTyped API
-
Dataframe
is an alias forDataset<Row>
- Behind the APIs, Catalyst optimizer(query planner) and Tungsten execution engine(in memory encoding) optimize applications
- Dataset is an immutable collection of strongly typed objects
- Encoder is responsible for converting between JVM objects and Spark's internal Tungsten binary format, allowing for operations on serialized data and improved memory and time utilization
- Dataset is richer API than RDD
- Encoders are highly optimized and use runtime code generation to build custom bytecode for serialization and deserialization. As a result, they can operate significantly faster than Java or Kryo serialization.
-
Create a Dataset
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class));
-
Transform a Dataset
Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
-
Untyped operations using Column and functions
Column ageColPlus10 = people.col("age").plus(10);