Dataframe Schema - ignacio-alorre/Spark GitHub Wiki
-
The schema information, and the optimizations it enables, is on of the core differences between Spark SQL and core Spark. Inspecting the schema is especially useful for DataFrames, since you don't have the templated type you do with RDDs or Datasets.
-
Schemas are normally handled automatically by Spark SQL, either inferred when loading the data or computed based on the parent DataFrames and the transformation being applied
-
DataFrames expose the schema in both human-readable or programmatic formats.
printSchema()
will show us the schema of a DataFRame in the consoles. For programmatic usage, you can get the schema by simply callingschema
{"name":"mission","pandas":[{"id":1,"zip":"94110","pt":"giant", "happy":true,
"attributes":[0.4,0.5]}]}
This schema would look like:
df.printSchema()
root
|-- name: string (nullable = true)
|-- pandas: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- zip: string (nullable = true)
| | |-- pt: string (nullable = true)
| | |-- happy: boolean (nullable = false)
| | |-- attributes: array (nullable = true)
| | | |-- element: double (containsNull = false)
df.schema()
org.apache.spark.sql.types.StructType = StructType(
StructField(name,StringType,true),
StructField(pandas,
ArrayType(
StructType(StructField(id,LongType,false),
StructField(zip,StringType,true),
StructField(pt,StringType,true),
StructField(happy,BooleanType,false),
StructField(attributes,ArrayType(DoubleType,false),true)),
true),true))