Week 02 (W47 Nov16) Global Climate Dataset - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Week 02 (W47 Nov16) Global Climate Dataset

1 - Summary

This week we digged deeper into our datasets. First we created functions to extract and manipulate our data efficiently and fast. This helped us obtain better control of our data. We then explored more features in our dataset and used various visualization methods to understand better our data, find patterns, outliers and possibly faulty data. Additionally, we quantified the missing values for our datasets and figured out ways to fill them. We applied transformations in the data to convert them into a useful format that can be correlated with features from other datasets. At the same time we merged smaller datasets into one to gather collective information. In particular, besides the features examined in previous week (minimum and maximum Temperatures) we explored the feature of Precipitation. We also decided to focus our analysis to India, as it is a country with big variations and the climate change effect is apparent. We picked datasets derived from the 7 biggest Indian cities, we examined them separately but also merged as one. At the same time we extended our research into the dataset of World Bank. We analysed the dataset and extracted information related to India, to prepare the ground for features correlation and prediction. Finally we applied PCA analysis to 3 core features of Delhi (Tmin, Tmax and PRCP). .

2 - Dataset Stats (Reminder)

Global Climate Data (GCD) : Main Dataset

Number of files: 100.791
Format: .dly files (Complete Works Wordprocessing Template)
Size: 26.5 GB
Features: 46
Source Date: 1763 - 2015

World Bank (WB) : Complementary Dataset

Number of files: 1
Format: .csv
Size: ~15 MB
Features: 82
Source Date: 1960 - 2015

3 - Goals Achieved This Week

Created functions to process and parse data efficiently
Improved visualizations about minimum and maximum Temperatures (Tmin, Tmax)
Explored Precipitation feature of GCD
Estimated missing values in both datasets GCD and WB
Filled missing values
Extracted datasets related to India from GCD and WB. Applied transformations to turn them correlatable
Explored features of WB and visualized
Applied Data Analysis methods to 3 core features of GCD

4 - Pre-Processing Methodology

Our dataset takes the form of multiple .dly files. Even though we can read these files as documents the format is unstructured. Thus our way of processing them is column-character-wise. The format of each entry is explained in the previous week Wiki. We will explain here shortly our methodology we follow to turn this unstructured format into a structured one. The steps we follow in Python:

Function to download dataset and then read content
Iterate on each dataset entry and then process based on the following characters sequence (Variable-Columns-Type): ID 0-10 Character YEAR 11-14 Integer MONTH 15-16 Integer ELEMENT 17-20 Character VALUE1 21-25 Integer MFLAG1 26-26 Character QFLAG1 27-27 Character SFLAG1 28-28 Character VALUE2 29-33 Integer MFLAG2 34-34 Character QFLAG2 35-35 Character SFLAG2 36-36 Character . . . . . . . . . VALUE31 261-265 Integer MFLAG31 266-266 Character QFLAG31 267-269 Character SFLAG31 268-268 Character
We split each row based on spaces into an array, remove flags, replace missing values (-9999) with NaN
We create a DataFrame that contains the features indexed with Datetimes (Year-Month-Day)
Finally we have functions that extract specific features and create a new DataFrame with them yearwise or monthwise

5 - GCD Findings

Minimum and Maximum Temperature (Tmin, Tmax) for Delhi, India

We re-visualized this week Tmin and Tmax yearwise to verify the trend of Temperatures rise. This time the results are from workstations in Delhi, India. First is the graph about Tmin and second the graph about Tmax yearwise. We observe that there is a slight rise in both maximum and minimum temperature over the years. However, there is a steep max in early years that is not fully accurate because of the big number of missing values in the early years. In order to estimate the temperatures year-wise we computed the average per month, where existing day values are available and then converted to yearly average. It is apparent that during months or years where most of the values are missing, the error of our estimates grows.

Tmin_yearwise Minimum Temperature year-wise for Delhi (tenths of degrees Celsius)

Tmax_yearwise Maximum Temperature year-wise for Delhi (tenths of degrees Celsius)

Further, we split the graph for Tmax into four different. Each one of them represent the maximum temperatures per season. The missing values are represented by NaN and are apparent in the graph where the line is discontinuous. It is not clear from the graph that there is a trend towards the shrinking of the seasons but it seems as if spring, summer and autumn temperatures approach each other over the years.

SeasonalTemp Maximum Temperature season-wise for Delhi (tenths of degrees Celsius)

Precipitation (PRCP) for Delhi, India

Based on the following figures, there is an apparent trend of increasing precipitation. At first sight there could be a correlation between precipitation and temperatures.

PRCP Precipitation year-wise for Delhi (tenths of mm)

PRCPmonth Precipitation season-wise for Delhi (tenths of mm)

The rain cycle seems flawless but after 1980 it has started getting distorted as rainfall has increased in other season as well other than monsoon.

Merging Datasets from Cities of India

Our climate data derive from various weather stations. Depending on their ID, longitude and latitude we can spot the exact location. In order to represent India we picked weather stations from the 7 biggest cities, located at different parts of India. We collected the data and then merged them. The final dataset contains data of the 7 biggest cities from 1901 to 2016. Below in the map the cities chosen are marked with red circles. The cities chosen along with their population are the following:

Mumbai (Bombay) - 16,368,000
Kolkata (Calcutta) - 13,217,000
Delhi - 12,791,000
Chennai - 6,425,000
Bangalore - 5,687,000
Hyderabad - 5,534,000
Ahmadabad - 4,519,000

indiamap Map of India. Cities marked with red circle are the datasets chosen

Missing values for India of GCD

One of our goals this week was to quantify the missing values in our dataset. For the merged dataset of 7 cities the missing values represent the 43.87% of the total values. The matrix of features showing the missing values is given below.

missingvaluesindia Missing values in matrix of features for Indian cities in GCD

6 - WB Dataset Findings

The WB dataset contains important features about pollution, emissions and other interesting factors that can be correlated with our climate data. We converted our data in GCD to yearly, since the data in WB dataset are only available yearly for every country. Thus this week we explored also this dataset and tried to understand more about through visualizations. The total values in the dataset are 4560. The 49.2% of them is missing though and marked as blank cells. Below is the feature matrix showing the missing values. Dark parts represent the existing values.

wbmissing Missing values in matrix of features for WB dataset

Following we decided to pick and visualize some specific features. We decided to examine the CO2 emissions, as well as the electricity production sources in India. It is obvious from the graph below that India's electricity production depends a lot on coal, which has severe effects on the environment. At the same time, we attempted to fill in the missing values for electricity production. Our interpolation attempts failed so for now we only did forward filling and backwards for the remaining ones. We plan to apply more accurate methods like spline and quadratic next week. What we did for now is repeat the previous available value to the next missing one.

electricityproduction Electricity production sources for India yearly

Further we visualized the feature of CO2 emissions in India and compared it with the global one. The graphs follow the same trend. The emissions increased almost exponentially over the years and tend to stabilize the recent years both globally and in India.

co2 CO2 emissions globally and in India yearly

7 - Data Analysis

We selected 3 core features of the GCD dataset (Tmin, Tmax and PRCP) for Delhi to apply some data analysis methods. We present our results in this section. First we visualized our feature matrix to examine the sparsity. The feature matrix contains daily values of three featuress (Tmin, Tmax and PRCP) for all available years. The figure below is the sparse representation of our matrix. It shows that we have 50205 number of complete data out of 25568*3 of the total, which translates to 65.45% of available data. Sparsity Matrix for daily data of Delhi (3 features - Tmin, Tmax, PRCP)

Due to the big amount of data, as well as for usability reasons we converted the daily data into yearly average for each of the features. We are aware that this decreases the accuracy of our measurements due to the increased number of missing values. However, yearly values are more useful to condense the amount of data and for correlation reasons. Next we present the Box plot which represents how the values of variables like precipitation , high temperature and low temperature lie around mean in different years.

box Box plot of 3 features with yearly data (PRCP, Tmax/HT, Tmin/LT) for Delhi

By applying PCA we created the following graph. This graph is useful if the first two principal coordinates do not explain enough of the variance in the data hence includes all the 3 principle components in the data. All three variables are represented in this tri-plot by a vector, and the direction and length of the vector indicate how each variable contributes to the two principal components in the plot. For example, the first principal component, on the horizontal axis, has positive coefficients for all three variables. The largest coefficients in the first principal component are the third and seventh elements, corresponding to the variables Low temperature and Precipitation.

3dpca 3D PCA Analysis for yearly data in Delhi

In order to explain variance we created a pareto. This screen plot only shows the first 2 (instead of the total 3) components that explain 95% of the total variance. The only clear break in the amount of variance accounted for by each component is between the first and second component. However, the first component by itself explains less than 40% of the variance, so more components might be needed. You can see that the first principal components explain roughly two-thirds of the total variability in the standardized ratings, so that might be a reasonable way to reduce the dimensions. Below the pareto, a figure of the two components scores is also illustrated.

pareto Pareto of two components for Delhi

Score Score of two components for Delhi

8 - Next Week Goals

Correlate data from the two datasets
Apply interpolation to missing values
Extend PCA, spectrum analysis

9 - Presentation Link

https://docs.google.com/presentation/d/1FTVLqrrU1XgHuw-flmaY73AEf2renMzCwAMmN15Va3A/edit#slide=id.p

References

Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.
Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12]. NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ
http://data.worldbank.org
https://www.co2.earth/global-co2-emissions