[GWA 010] Experimental Design - Neelam91/Global-Warming-Analysis GitHub Wiki
In order to conduct the analysis, I collected and stitched together datasets from various opens source websites. Subsequently i undertook extensive data cleanup and massaging to make it statistical analysis worthy. Once, I got the data in an acceptable form for analysis, I chose three analysis methods for this project.
Data collection: Endogenous Variables: I took three endogenous variables in my study to evaluate manifestation of global warming. The variables are as follow:
- Global Temperature dataset (measured in degrees Celsius), sourced: (https://datahub.io/core/global-temp#resource-annual).
- Global Sea Level rise dataset(measured in inches), sourced:(https://datahub.io/core/sea-level-rise#data).
- Arctic Sea Ice Melting dataset (measured in area in million square KM), sourced:(https://climate.nasa.gov/vital-signs/arctic-sea-ice/)
Exogenous Variables:
- Agriculture Land data (measured in % of land area)
- Livestock Production dataset(reported as index)
- Fossil Fuel consumption (measured in terawatt hours)
- CO2 emission (measured in parts per million ppm)
- Green House Gas emission (measured in emission per capita)
- Global Renewable Energy share (measured in % share)
- Deforestation (Measured in million cubic meter) source:(Source: Compiled by Earth Policy Institute from U.N. Food and Agriculture Organization, FAOSTAT, electronic database, at faostat.fao.org, updated 23 July 2012)
Research Methods: A. Hypothesis Testing: I assessed the statistical significance of my research hypothesis by using Two Sample two tail t-test and plotted a time series graph to understand the trend visually.
Null Hypothesis: There is no global warming.
Alternate Hypothesis: Global warming is happening.
B. Multivariate Regression Analysis: I used a Multivariate linear regression model to understand the significant factors driving global warming. Global warming manifestation was assessed using the endogenous variables described above.
Analysis involved following steps:
a) Feature Selection: A group of 7 variables were selected after an in-depth analysis of research on global warming. The variable were subsequently subjected for Recursive Feature Elimination to find the statistically significant factors. The statistical significance of the variable was also assessed by achieving p values > 0.05.
b) Model Performance: Model performance was evaluate across two areas:
Under-fitting - Model under-fitting was checked by looking at the Adjusted R square value in the regression model. The Adj. R square was 0.76 which is an indicator of a good model fit.
Overfitting - Model overfitting was checked by comparing the model performance on training versus an independent test set. Also the root mean square error of the model was 0.13 which indicated that the model is not overfitting.
C. Clustering model: To find out the countries who are most responsible for global warming.
Technology Used: Python and Excel.