Chapter 3. A Tour of Spark’s Toolset - karbigdata/BigDnotes GitHub Wiki
Datasets: Type-Safe Structured APIs The first API we’ll describe is a type-safe version of Spark’s structured API called Datasets,
The Dataset class is parameterized with the type of object contained inside: Dataset in Java and Dataset[T] in Scala. For example, a Dataset[Person] will be guaranteed to contain objects of class Person.
// in Scala
case class Flight(DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME: String,
count: BigInt)
val flightsDF = spark.read
.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]
One final advantage is that when you call collect or take on a Dataset, it will collect objects of
the proper type in your Dataset, not DataFrame Rows. This makes it easy to get type safety and
securely perform manipulation in a distributed and a local manner without code changes:
// in Scala
flights
.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
.map(flight_row => flight_row)
.take(5)
flights
.take(5)
.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
.map(fr => Flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME, fr.count + 5))