Spark demo - Nick-Korn/Data-Analysis GitHub Wiki
Spark demo planning and explanation
Goals of the demo
- Analyzing historical weather data with spark
- Learning the usage of spark
About data
The data that will be used in the spark demo:
- Data includes a years worth of weather data
- Data was acquired from Finnish Meteorological Institute (FMI)
- The data was first acquired in XML format, then parsed and reformatted to .json with this script
- Explanation to data terms thanks to a blog post by Joona Lehtomäki
- Five types of data was found useful and their units:
- rrday = amount of precipitation per day (mm)
- tday = days average temperature (degC)
- tmin = lowest temperature reading of the day (degC)
- tmax = highest temperature reading of the day and snow = the amount of snow per day (degC)
- snow = the depth of snow that day (cm)
Drafts for demo (analysis with spark)
What things could be analyzed:
- Amount, types, days and time frame of observations
- Few different time frames
- Value specific things
- highest and lowest temps
- average temps of months and the year
- days when precipitation happened
- days when there was snow on the ground
Demo implementation process
The following code was used to analyze data. Zeppelin was used to visualize data.
The following graphs were acquired from the data with SQL-queries in Zeppelin
Days when there was snow in the ground (cm)
SQL-query:
Resulting diagram:
Daily max temp of over 20 degree Celsius
SQL-query:
Resulting diagram:
Daily min temp of under -20 degree Celsius
SQL-query:
Resulting diagram:
Days when precipitation happened, ie. when something rained or snowed (mm)
SQL-query:
Resulting diagram: