Week 09 (W54 Jan25) Global Climate Dataset

1- Summary

This week we worked further on our prediction model and fixed some mistakes of the previous week. We followed three approaches for our prediction model: i) Predict emissions over the years with linear regression of a rolling window to deal with the concept drift. ii) Employs PLSR to predict average temperature rise according to emissions rate. iii) Use of auto-regression model for time series to predict temperature over the years. We split our data into training, testing and validating. For i) we followed a 3-fold validation, while for ii) PLSR a 10-fold validation. The results are pretty satisfying achieving small mean square error. What we need to do further is combine our 3 distinct models to derive more accurate predictions and conclusions for specific locations.

2 - Dataset Stats

Global Climate Data (GCD) : Main Dataset

Number of files: 100.791
Format: .dly files (Complete Works Wordprocessing Template)
Size: 26.5 GB
Features: 46
Source Date: 1763 - 2015

World Bank (WB) : Complementary Dataset

Number of files: 1
Format: .csv
Size: ~15 MB
Features: 82
Source Date: 1960 - 2015

3 - Prediction of emissions over the years

According to our analysis of previous weeks linear regression seemed a good fit for the purpose of modeling emissions over the years for most cases. Last week however our approach gave bad results. This is because we trained for a wide number of years and tried to predict a pretty big range also. More specifically, we split our data and we trained from 1960 to 2006 and tried to predict 2007-2016 with linear regression. As a result we had a huge mean least square error, because we did not take into account the concept drift. This week we implemented linear regression (scikit-learn Python library) with a rolling window of 3 years. We also tried a 5 years window, but 3 years provided more accurate results. We applied a 3-fold validation and tried to predict our test data point for year 2016. The accuracy for 2016 prediction of CO2 emissions in Brazil was really good. We achieved a 0.04% relative error and and mean square error of 67182.82.

4 - Prediction of average temperature rise based on emissions

We employed PLSR to predict the average temperature rise based on the distinct and total emissions. PLSR is a method for relating two data matrices, X and Y, by a linear multivariate model, but goes beyond traditional regression in that it models also the structure of X and Y. PLSR derives its usefulness from its ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y [www.libpls.net]. What we did is take the mean of all aggregated emissions of all years for a country and correlate it with average temperature rise for that specific country to quantify the effect of climate warming. We will also do the same for all emissions of all countries at a specific year and correlate it with a local temperature rise. We followed a 10-fold validation for this purpose. Our results for predicting the average temperature rise for two countries (Azerbaijan and Panama) are shown in the following graph. Fitted value is the one of PLSR, while observed is the actual value from our data. PLSR gives us the total error based on weights we get. The estimated mean square prediction error of our PLSR model is 0.035.

5 - Prediction of country temperatures over the years

Finally we attempted to predict directly the average local temperature, based on the temperature trend of previous years independent of emissions. In order to predict the temperature of a country over the years we employed an auto-regression model for time series. We trained data from 1960 to 2015 and tried to validate with data of 2016. The results achieved were satisfactory with mean square error of 0.255. This model could work as an extra validation point for our previous two models. More specifically we attempted to predict the average temperature of India. We cleaned and structured the Avg.Temperature Data for India till 2015 and we tried to predict temperature for 2016 using an Autoregression Model for Time Series Forecasting (AR from statsmodels library).

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems. A regression model, such as linear regression, models an output value based on a linear combination of input values. For example:

yh = b0 + b1*X
Where yh is the prediction, b0 and b1 are coefficients found by optimizing the model on training data, and X is an input value.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables.

For example, we can predict the value for the next time step (t+1) given the observations at the last two time steps (t-1 and t-2). As a regression model, this would look as follows:

X(t+1) = b0 + b1X(t-1) + b1X(t-2)

Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self).

A line Plot of Dataset with example of Data :

To check if there is an autocorrelation in our time series dataset :

Temperature data (t) on the x-axis against the temperature on the previous year (t-1) on the y-axis.

We can see a large ball of observations along a diagonal line of the plot. It clearly shows a relationship or some correlation.

Next, we divided our data in test data and training data , test data Contains last 7 years from 2009 to 2015 and Training Data contains all other data.

Autoregression model giving Mean Squared error = 0.255

year 2009 - predicted=24.519, expected=24.650
year 2010 - predicted=24.544, expected=24.406
year 2011 - predicted=24.481, expected=25.147
year 2012 - predicted=24.441, expected=25.051
year 2013 - predicted=24.467, expected=24.416
year 2014 - predicted=24.507, expected=24.641
year 2015 - predicted=24.459, expected=25.413

and Graph Looks like:

According to this model the predicted average temperature for India in 2016 is 24.457 degrees Celsius.

6 - Overview

Our models gave satisfactory results. However, several assumptions were made about the contribution of the emissions to temperature rise, concerning the geographical distance of the emissions source and their effect on specific locations. For example, the case of China emissions affecting all countries uniformly. In addition to this, we assume the trend of rising emissions and temperature will keep going on like previous years. All these are necessary to create our models, although they might not totally have a valid physical point. We deal with the concept drift through a regression of rolling window, and it seems to provide low errors in our tests, in compare to previous week when we did not use it. Further, we need to adapt the models of predicting the impact (emissions, temperature rise) on specific locations, taking into account a distance parameter. Finally, we need to combine our models, meaning: the model of predicting emissions over the years feed the model of PLSR for predicting average temperature rise according to emissions rates. Finally, compare the latter prediction with the prediction of the model of predicting temperature over the years. In addition to this, our final model will show the biggest contributors-countries to a specific country's temperature rise (based on their emissions and distance to the country).

7 - Next Week Goals

Divide into best and worst case scenario
Combine models and finalize prediction models

8 - Presentation Link

https://docs.google.com/presentation/d/1CG5PhNvzAhNSg_LmW62u7ZPe5yFdyQU20gcYnohhBxc/edit#slide=id.g20b4ecb7a4_0_0

References

Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.
Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12]. NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ
WB Dataset - http://data.worldbank.org
Correlation Analysis - http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Multivariable/BS704_Multivariable5.html
Climate change impacts on Austrian ski areas, Robert Steiger & Bruno Abegg (Link)
HFCs? Curbing Them Is Key to Climate-Change Strategy (Op-Ed), Hallie Kennan, Energy Innovation: Policy and Technology (Link)
How do we know more CO2 is causing warming? (Link)
Does CO2 always correlate with temperature (and if not, why not?)
Earth itself is telling us there’s nothing to worry about in doubled, or even quadrupled, atmospheric CO2
China Exports Pollution to U.S., Study Finds

Week 09 (W54 Jan25) Global Climate Dataset - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Week 09 (W54 Jan25) Global Climate Dataset

1- Summary

2 - Dataset Stats

3 - Prediction of emissions over the years

4 - Prediction of average temperature rise based on emissions

5 - Prediction of country temperatures over the years

6 - Overview

7 - Next Week Goals

8 - Presentation Link

References

⚠️ GitHub.com Fallback ⚠️

Week 09 (W54 Jan25) Global Climate Dataset - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Week 09 (W54 Jan25) Global Climate Dataset

1- Summary

2 - Dataset Stats

3 - Prediction of emissions over the years

4 - Prediction of average temperature rise based on emissions

5 - Prediction of country temperatures over the years

6 - Overview

7 - Next Week Goals

8 - Presentation Link

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️