Spark Dataset - keshavbaweja-git/guides GitHub Wiki

  • Dataset API provides a type-safe, object oriented programming interface for structured data
  • Dataset API provides compile time safety (Typed API)
  • Dataframe API is the UnTyped API
  • Dataframe is an alias for Dataset<Row>
  • Behind the APIs, Catalyst optimizer(query planner) and Tungsten execution engine(in memory encoding) optimize applications
  • Dataset is an immutable collection of strongly typed objects
  • Encoder is responsible for converting between JVM objects and Spark's internal Tungsten binary format, allowing for operations on serialized data and improved memory and time utilization
  • Dataset is richer API than RDD
  • Encoders are highly optimized and use runtime code generation to build custom bytecode for serialization and deserialization. As a result, they can operate significantly faster than Java or Kryo serialization.

API

  • Create a Dataset
    Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class));

  • Transform a Dataset
    Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));

  • Untyped operations using Column and functions
    Column ageColPlus10 = people.col("age").plus(10);

⚠️ **GitHub.com Fallback** ⚠️