Spark demo - Nick-Korn/Data-Analysis GitHub Wiki

Spark demo planning and explanation

Goals of the demo

  • Analyzing historical weather data with spark
  • Learning the usage of spark

About data

The data that will be used in the spark demo:

  • Data includes a years worth of weather data
  • Data was acquired from Finnish Meteorological Institute (FMI)
  • The data was first acquired in XML format, then parsed and reformatted to .json with this script
  • Explanation to data terms thanks to a blog post by Joona Lehtomäki
  • Five types of data was found useful and their units:
    • rrday = amount of precipitation per day (mm)
    • tday = days average temperature (degC)
    • tmin = lowest temperature reading of the day (degC)
    • tmax = highest temperature reading of the day and snow = the amount of snow per day (degC)
    • snow = the depth of snow that day (cm)

Drafts for demo (analysis with spark)

What things could be analyzed:

  • Amount, types, days and time frame of observations
  • Few different time frames
  • Value specific things
    • highest and lowest temps
    • average temps of months and the year
    • days when precipitation happened
    • days when there was snow on the ground

Demo implementation process

The following code was used to analyze data. Zeppelin was used to visualize data.

The following graphs were acquired from the data with SQL-queries in Zeppelin

Days when there was snow in the ground (cm)

SQL-query:

Resulting diagram:

Daily max temp of over 20 degree Celsius

SQL-query:

Resulting diagram:

Daily min temp of under -20 degree Celsius

SQL-query:

Resulting diagram:

Days when precipitation happened, ie. when something rained or snowed (mm)

SQL-query:

Resulting diagram: