Week 01 (W46 Nov16) Global Climate Dataset - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Week 01 (W46 Nov16) Global Climate Dataset

Summary

This week we downloaded and explored our datasets. We mainly focused on our main dataset, which the Global Climate Daily Dataset. We examined the features, their types and figured out efficient ways to collect the data, organize in data structures, pre-process them and visualize. We particularly examined a major feature of our dataset (Maximum Temperature), identified problems, figured out transformations to solve some of them by visualizing the features and executing some transformations. Additionally we tried to filter the data to try fit an appropriate distribution.

Dataset Stats

Global Climate Data (GCD) : Main Dataset

Number of files: 100.791
Format: .dly files (Complete Works Wordprocessing Template)
Size: 26.5 GB
Features: 46
Source Date: 1763 - 2015

World Bank (WB) : Complementary Dataset

Number of files: 1
Format: .csv
Size: ~15 MB
Features: 82
Source Date: 1960 - 2015

Feauture Types

Core Feautures

GCD Dataset

ID : Nominal (station identification code)
YEAR : DateTime (year of the record)
MONTH : DateTime (month of the record)
PRCP : ratio-scaled (Precipitation (tenths of mm))
SNOW : ratio-scaled (Snowfall (mm))
SNWD : ratio-scaled (Snow depth (mm))
TMAX : interval-scaled (Maximum temperature (tenths of degrees C))
TMIN : interval-scaled (Minimum temperature (tenths of degrees C))
MFLAG : Nominal,Integer (measurement flag)
QFLAG : Nominal,Integer (quality flag)
SFLAG : Nominal,Integer (source flag)

WB Dataset

Agricultural land (sq. km) : float
Forest area (sq. km) : float
Electricity production from oil sources (% of total) : ratio-scaled
Renewable electricity output (% of total electricity output) : ratio-scaled
Electricity production from renewable sources, excluding hydroelectric (kWh) : float
Electricity production from renewable sources, excluding hydroelectric (% of total) : ratio-scaled
Renewable energy consumption (% of total final energy consumption) : ratio-scaled : float
Energy use (kg of oil equivalent) per $1,000 GDP (constant 2011 PPP) : float
Electric power consumption (kWh per capita) : float
Energy use (kg of oil equivalent per capita) : float
CO2 emissions (kt) : float
Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent) : float
Annual freshwater withdrawals, total (billion cubic meters) : float
Terrestrial protected areas (% of total land area) : ratio-scaled
Population growth (annual %) : ratio-scaled

Rest of the features also belong to the Integer, Nominal, interval-scaled and ratio-scaled type. We are not writing them down in this report to limit the size of the report, but we will invoke them for analysis in future work.

Data Structure and Collection

Our dataset consists of many .dly files. Each .dly file contains data for one station. The name of the file corresponds to a station's identification code. For example, "USC00026481.dly" contains the data for the station with the identification code USC00026481.

In order to make use of our dataset we decided to merge the .dly files into one using cat * > merged_dataset.dly Then using data frames, every entry contains all the measurements of a station for all the available dates. Additionally, we decided to create a script to directly download the desired dataset for any station from the source through ftp.

The format of a typical entry looks like the following: ‘X1': 'CA002303986198503PRCP 90 C 0 C 0 C 0 C 4 C 0 C 2 C 8 C 0 C 80 C 0 C 0 C 0 C 0 C 0 C 12 C 122 C 186 C 0 C 0 C 0T C 0 C 0 C 57 C 13 C 0 C 0 C 0 C 0 C 26 C 0 C'

where CA002303986198503PRCP is the ID. First two letters, declare the country (here Canada), next 11 characters contain the station-specific ID, next 4 the year of the record, next 2 the month of the record and finally last 4 characters describes the element measured. These records repeat for all the available features and for 31 days for each (!) wether measurements were taken or not. The rest of the columns contain values and flags. Each entry has the format VVVVF1F2F3, a sequence of 4 numbers followed by the 3 different flags (optional).

Instance Related Issues

There are missing values for many days and there are missing values for meteorological elements. The missing values are marked with -9999. There are obvious outliers for our data.
Flags are not important to our purposes, as they contain information about the conditions and efficiency of the measurements. They can be integers or alphabetical or blank if there are none. They need to be filtered out. An issue arises here if they are integers (ex. 0, 6, 7) as there might be conflict with the actual values of the elements and lead to faulty or noisy data.
There are values of the elements for every day of the month in each record. However, there are 31 values each time, and depending on the number of days each month has, the remaining are missing. This creates an issue in data structures and corrupts the actual date representation in the visualization.

Processing

For the first week to obtain a better view of our dataset we decided to analyze and visualize one of the main features, the maximum Temperature measured. Particularly we examine this feature for all the measurements of the station with ID: AGM00060425 for all the years, months, days. The location of that workstation is Algeria according to the three first letters of the ID ("AGM"). We are using Python for any coding and processing executed this week. The format of one of the records in the data structure is as we already mentioned above the following: 'AGM00060425194302TAVG-9999 148H S 96H S-9999 -9999 -9999 -9999 -9999 93H S-9999 -9999 -9999 -9999 -9999 126H S-9999 -9999 -9999 -9999 -9999 133H S 79H S-9999 -9999 -9999 -9999 112H S 106H S-9999 -9999 -9999 \n'

Our goal is to isolate the date of each measurement and the measurement values, create a new data structure and visualize it. We executed the following steps:

Extracted the years in an array "[1957, 1957, 1957, 1967...]"
Extracted the months in an array "[1, 2, 2, 3...]"
Estimated days of each month for any year with "calendar" library of Python and the sequence of number of days in another array "[1..31]"
Concatenated the three arrays above and converted datatype string to datatime
Extracted maximum Temperature values per day
Filtered out flags with the use of regex (regular expression)
Visualized data [Figure 1]
We observe that due to the outliers the data are not clear and do not provide actual information. Thus, we replaced the outliers (values = -9999) with nan (not a number) and re-visualized. [Figure 2]
We now observe several missing and scattered values, as well as a clustering of Tmax values around 100 and 450, which translates to between 10 and 45 degrees Celsius. This makes sense physically as the workstation is located in Algeria, which is known for its higher temperatures.
Finally we created an histogram with our data to estimate a possible distribution. [Figure 3] Seems like a cosine distribution over the months, which makes sense due to the seasonal effect of temperatures, with exception some high maximums.
A different diagram of a different dataset (AR000877500) for 2 years interval shows better the cosine distribution of the feature.

Analysis of our approach

We had some issues at first with the large amount of data (~29 GB), particularly with the .dly formats and how to process them. We decided to merge them into one .dly file for future use. At the same time, we created a function to connect to the dataset server through ftp and download directly single .dly files and process them. However the files contain rows of data in an unstructured data format. Thus we had to process each entry based on characters column-wise, and split the important features then place them into an array. So we created an array of features, which contains the data of a whole month (actually 31 values independent of the days each month has). Eventually we had an array of those arrays that contained all element values for all days/months/years. This array describes all the measurements of a particular station. At the same time we removed the literal flags that were included in the measurements. However, there are flags with integers, which we neglected. This could likely cause faulty values in our data and diagrams. We converted the dates into the datetime datatype of python in order to be able to process it better for our plots. At the same time, we made use of the python calendar library to find out the days of each month and delete the empty cells according to the difference from 31, when necessary. This made our dataset more realistic and plotting against datetime possible. This week we realized that we will have challenges in visualizing the large amount of data in a way that actual information can be derived. Filtering, removing some parts of dataset and selective data mining might be necessary to obtain more clear observations and results. Otherwise visualizing thousands of data is tough.

Datasets Merging

Our complementary dataset of World Bank (WB) contains a combination of Integer, Nominal, interval-scaled and ratio-scaled types of features. It contains useful information about the environment, pollution and emissions statistics, population statistics etc which can be combined with our climate data to derive fruitful results and correlate effects. For example the percentage of population growth can be correlated with the increased CO2 emissions and further the rise of average temperature. The WB dataset contains data per year and per country from 1960 to 2015. The GCD dataset contains data per day/month/year and per station/city/country from 1763 to 2016. Our merging steps of the two datasets are the following:

Extract data and organize data country-wise for years 1960-2015 from GCD.
Compute yearly average values for data
Convert .dly to csv
Missing values in GCD are marked as -9999, while in WB as blank. Normalize
Merge into one .csv file
Filter out countries and years that contain limited data, further processing

We visualized the Tmax feature of a station from Argentina (AR000877500), as well as the feature of Electricity production from the World Bank dataset for the same country to make some comparisons. Below are the two histograms. It is clear that we need to convert the daily data to yearly to make the actual comparison, but also scale the missing values and time frames.

fig4

Next Week Goals

Concatenate data structures for GCD dataset
Create functions to manipulate dataset and extract easily
Solve issues, figure out better structures to organize and condense the large amount of data
Cut down the big number of missing values. Remove data that corrupt the image/distribution of our data
Explore and visualize more features of datasets, attempt to fit massive data
Explore the option of filling in missing meteorological elements for locations and dates
Visualize the WB dataset
Attempt the merging of our datasets

Presentation Link

https://drive.google.com/open?id=1Jq2txTMBjMQblsXBrESbbv682Zdhsj8M0yRUB3EuySg

References

Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.
Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12]. NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ