Week 08 (W03 Jan18) Terrorism Database - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Summary & Index:
In this week we continued the predictive phase of our work. The main conclusions achieved are:
- For our first prediction task, we provide a model that can predict with good accuracy if a terrorist attack will have mortal victims or not. However, predicting the precise number of victims is a more challenging task.
- For our second prediction task, creating a model for each region proved to be a good strategy. The accuracy of the models found is around 0.80.
- Relating the terrorism with demographic data is a more difficult prediction task. We are still searching for a demographic dataset that fits with our Terrorism Dataset.
Wiki Status - concluded
Index
###1 - Number of kills prediction ###2 - Terrorist group prediction ###3 - Weekly Presentation ###4 - Perceived Feedback
Weekly work
1 - Number of kills prediction
To start the task of predicting the number of kills in a terrorist attack we checked the frequencies for this variable (Fig. 1). The majority of the attacks does not have mortal victims.
Figure 1 - Frequencies of the number of kills
When we tried to conceive a model for prediction we were facing some problems:
- A lot of incidents with 0 or 1 mortal victim. This was a problem because in this case, the model tended to always guess with these values, which implied a large amount of error.
- few attacks were having a lot of victims (e. g. less than 1% of the attacks has more than 32 victims). Therefore we consider >32 victims outliers and not regarded in our model.
In order to simplify our problem we decided to divide this problem into two subtasks:
- Predict if the attack has no victims (0 victims); low victim number [1;32[; or high victim number [32;inf[.
- Predict the number of victims when it is in the low category's [1;32[.
With this approach, we need to develop two models, one for each subtask as we can see in Fig. 2.
Figure 2 - Strategy for the number of kills prediction.
We then started the first task.
1) Task 1: Are there mortal victims?
As we can see (Fig.3) the number of attacks "with no victims" and "low victim number" is similar. However, there is only a minority of attacks with a "high victim number". Because of this, we aim to develop a model that performs well for "with no victims" and "low victim number". We consider attacks with more than 32 victims outliers.
Figure 3 - Frequencies of the number of kills
We build the model for this classification problem. The algorithm that achieved the best solution was Random Forest. We can see in Fig. 4 and 5 the performance of this algorithm.
Figure 4 - Performance of random forest
Figure 5 - Confusion matrix with random forest
We can conclude that the obtained model is good to solve this problem (e. g. has an accuracy of 80%). Besides, we should also point out that it does not perform well for attacks with a high number of victims. However, this is not problematic in our point of view, due to the low frequency of this type of attacks.
2) Predict the number of victims when it is a "low victim number"
For this specific task we used different algorithms, in the Fig. 4 we can see that the linear regression was the one that performed better.
Figure 4 - Metrics obtained when performed Linear Regression, Multilayer Perceptron and RepTrees.
In the Fig. 5 we can see that, despite linear regression being the algorithm with better performance, the results are far from ideal.
Figure 5 - Comparison between Real value (x axis) and predicted value (y axis) using Linear Regression.
In the next week, we will try to find a way to improve this model (e. g. Feature Extraction or Selection)
2 - Terrorist group prediction
During this week we also started researching on the prediction of the terrorist group, given the characteristics of the attack. This prediction task has 2 motivations behind it:
- Terrorist groups may not claim responsibility right away (or even never): our model can help tracking terrorist groups before they claim responsibility, in order (for example) for the police to track faster active cells of that terrorist group and catch potential perpetrators preparing to be on the run.
- Terrorist groups may claim responsibility falsely: our model can help detect terrorist groups that claim attacks they were not responsible for (which they may advertise for propaganda purposes) and help to find the real perpetrators.
2.1 - Methodology Definition
For this prediction task, we dropped all attacks that were marked as doubtful as whether they should be considered terrorist attacks or not. We also made a manual selection of the "real terrorist groups" by using search engines (for example, "Palestine" should not be considered a terrorist group) and we kept terrorist groups that were marked as "Other".
The top 10 most frequent groups worldwide are represented in the table below:
Terrorist group | Incidents |
---|---|
Other | 14695 |
Taliban | 4762 |
Shining Path (SL) | 4134 |
Islamic State of Iraq and the Levant (ISIL) | 2372 |
Farabundo Marti National Liberation Front (FMLN) | 2129 |
Revolutionary Armed Forces of Colombia (FARC) | 2037 |
Basque Fatherland and Freedom (ETA) | 1903 |
Irish Republican Army (IRA) | 1884 |
Boko Haram | 1700 |
Communist Party of India - Maoist (CPI-Maoist) | 1612 |
During this prediction task we used the following features:
- Year of the attack (Numeric)
- Existence of multiple attacks (Boolean)
- Success or not of the attack (Boolean)
- Existence of suicide regarding one or more perpetrators (Boolean)
- Number of perpetrators (Numeric)
- Number of perpetrators captured (Numeric)
- Number of killings (Numeric)
- Number of wounded (Numeric)
- Number of hostages kidnapped (Numeric)
- Asked ransom amount (Numeric)
- Paid ransom amount (Numeric)
- Number of released hostages (Numeric)
- Type of the attack (Categorical)
- Type of the target (Categorical)
- Nationality of the victims (Categorical)
- Type of the weapon (Categorical)
- Subtype of the weapon (Categorical)
The categorical features were encoded using One-hot encoding (meaning they were transformed in the boolean variables).
In order to solve this prediction problem, we divided the dataset according to the geographic region of the attack (12 in total) and modeled the problem as a multi-classification problem by building a different model for each regional sub-dataset and considering 5 different classes as the 5 terrorist groups with more incidents in each region, plus one class that includes all the other groups in each region ("Other").
We then trained a Random Forest model (using a One-vs-All Multiclass Classifier) for each geographical region (using the Training (60%), Validation (20%), Test (20%) methodology).
2.2 - Results
The accuracies over the Validation set are presented below::
Geographical region | Accuracy fraction | Accuracy value |
---|---|---|
Sub-Saharan Africa | 1040/1121 | 0.92774308653 |
Middle East & North Africa | 1488/1655 | 0.899093655589 |
South Asia | 2058/2323 | 0.885923374946 |
Central America & Caribbean | 681/770 | 0.884415584416 |
North America | 179/224 | 0.799107142857 |
Western Europe | 1141/1433 | 0.796231681786 |
Southeast Asia | 530/675 | 0.785185185185 |
South America | 1707/2177 | 0.784106568672 |
Australasia & Oceania | 9/12 | 0.75 |
Eastern Europe | 27/43 | 0.627906976744 |
Central Asia | 6/10 | 0.6 |
East Asia | 17/32 | 0.53125 |
Given the high accuracy of the model when predicting terrorist groups in South Asia, as well as the high number of correctly predicted instances, we decided to study further the model in this geographic region.
2.2.1 - South Asia study case
First, it's important to analyze the frequencies of the different classes in this region, which is presented below:
Terrorist group | Incidents |
---|---|
Taliban | 4762 |
Other | 2790 |
Communist Party of India - Maoist (CPI-Maoist) | 1612 |
Liberation Tigers of Tamil Eelam (LTTE) | 1085 |
Tehrik-i-Taliban Pakistan (TTP) | 990 |
People's Liberation Front (JVP) | 377 |
It's noticeable from this table that the classes "Taliban" and "Other" are the most dominant ones. An analysis of the confusion matrix (on the figure below) reveals that the model is balanced even with the class unbalance here.
Figure 6 - Confusion matrix of the predicted group.
3 - Weekly Presentation
https://docs.google.com/presentation/d/1uW9iSe1YXwRk-u6SEjXpkTEjhZR53Nzyc7-9dc9C8AI/edit?usp=sharing
4 - Perceived Feedback
- Get better evaluation metrics for regression.
- Use weight functions in the classifiers or balance the bins of deaths (using over or under sampling).
- Maybe use the country in which the attack took place to predict the group.