Week 07 (W02 Jan11) Terrorism Database - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Summary & Index:
In this week we started the predictive phase of our work. The main conclusions achieved are:
- A preliminary experience in predicting the number of victims achieved good results.
- In the future, it is also important to know how to predict the terrorist group.
Wiki Status - Concluded
Index
###1 - Descriptive Phase Conclusions ###2 - Predictive Tasks ###3 - Weekly Presentation ###4 - Perceived Feedback
Weekly work
1 - Descriptive Phase Conclusions
During the descriptive analysis of the data, we reached some conclusions that we should bear in mind during the next phases of our work. The main conclusions are:
-
Terrorism has been increasing in the latest years, however, it has different characteristics depending on space and time.
-
Despite the high number of terrorist groups present in the dataset, there is a subset of terrorist groups that is responsible for the larger majority of the attacks and therefore only this subset should be considered in the future analysis.
-
Some attacks have particular cHaracteristics, for instance, when kidnapping is involved.
-
The majority of the text fields in the dataset are redundant with the numeric fields, and it does not worth to conduct a text mining analysis.
-
Finally, for final analysis, we consider not all dataset. We excluded, on one hand, the attacks where exist doubt about being terrorism. On the other hand, we also excluded the variables that presented a high number of missing values and the variables with unstructured text fields.
2 - Predictive Tasks
After considering the descriptive analysis, the group decided that the most promising predictive analysis that should be considered are:
-
Predict the terrorist group based on the attack characteristics. This is a very useful predictive task because when a terrorist attack occurs sometimes there is no group claiming this attack (e. g. a high number of terrorist attacks has no author). Besides, it can also happen that terrorist groups falsely claim an attack, and in this case, we can check on the probability of this to be true.
-
Predict the number of victims of an attack, based on the type of attack, the target of the attack, and weapons used. This can be useful for the civil protection to better allocate resources when a terrorist attack occurs.
-
Predict the cities where terrorist attacks can occur. In this case, the group needs to join the current dataset with another one with demographic data. Based on demographic characteristics of previous attacks it is possible to predict new ones. This task is also useful in order for governments to take preventive measures.
2.1 Prediction of Number of Victims
We used a Linear Regression model for predicting the number of victims of an attack. The training and prediction are done based on the variables which can be used to define an attack: 'iyear', 'extended', 'suicide', 'nperps', 'ishostkid', 'nhostkid', 'weaptype','attacktype', 'targtype' We tried to predict nkill, nwound, and nkill+nwound. Total Entries: 132544
2.1.1 Methodology
In order to have a first appraisal on predicting the number of victims/wounded, we fitted a linear model (Linear Regression) with several categorical and numeric variables (e.g. year of the attack, success of the attack, existence of suicides, weapon type, target type, attack type). Throughout these experiments, we used 10-fold cross-validation.
2.1.2 Predicting nkill
###Frequency of number of kills
As we can see in fig. 1 we were able to conclude that the majority of the attacks has 1 or less number of kills.
Figure 1 - Frequencies of the number of kills
###Classification of errors
We computed a metric of error:
error = predicted_value - actual_value
In fig. 2 we can see the distribution of the error.
Figure 2 - Frequencies of the error in the number of kills prediction task.
From this two charts, we concluded that we can predict to within +-5 nkills with an accuracy of 91.66% (the median error was 1.27). despite this good result, we should take into account that the majority of the attacks has a low number of victims, therefore it is expected a low value in the estimated value.
###Error variation dependinG on number of kills We wanted also to check if the error in the data was constant or depending if the incident had a higher or lower number of victims. In the fig. 3 we can see that the error obtained increases when the number of victims also increases.
Figure 3 - Plot of error vs actual value with outliers
To have a better understanding of this relation, we decided that we should remove some outliers values. In fig. 4 we can see the resulting plot.
Figure 4 - Plot of error vs actual value without outliers for the number of kills prediction task.
The results obtained so far in both pictures prove that due to the large density of data in the lower number of victims it is not possible to see clearly what is happening in this area. In the next weeks, we should continue and explore this.
2.1.3 Predicting "Number of wound"
Similarly to the prediction task for the number of kills, we try to predict "Number of wounds". We achieved similar results, as we can see in the following figures.
Figure 5 - Frequency of number of wounded
Figure 6 - Frequencies of the error in the number of wounded prediction task.
Figure 7 - Plot of error vs actual value for the wound number prediction task.
2.1.4 Predicting nkill + nwound
Finally, we try also to predict the "number of kill + number of wound". We achieved similar results, as we can see in the following figures.
Figure 8 - Frequency of number of "number of kills + number of wounded"
Figure 9 - Frequencies of the error in the "number of kills + number of wounded" prediction task.
Figure 10 - Plot of error vs actual value for the "number of kills + number of wounded" prediction task.
2.1.5 Future improvements
- Currently our prediction does not depend on the group which carried out the attack. We can include a group profile to the predictor variables.
3 - Weekly Presentation
https://docs.google.com/presentation/d/1NuD8CwMMkfFydBnP1C-p3W9Jke0yRPuM5eZ0STfF7DA/edit?usp=sharing
4 - Perceived Feedback
- The relative error can be more informative.
- Errors with fewer victims are worse.
- Use mean absolute error, mean square error, relative absolute error, because this way is confusing (at least put the bars next to each other.
- Give the performance on the different types of attacks.
- Transform this into a classification attack (either it's the right number, or it's not) - small attack, medium attack, big attack. We can generate a model for small attacks and another model for big attacks.
- Choose the task that sells the best