Week 08 (W03 Jan18) Terrorism Database - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Summary & Index:

In this week we continued the predictive phase of our work. The main conclusions achieved are:

For our first prediction task, we provide a model that can predict with good accuracy if a terrorist attack will have mortal victims or not. However, predicting the precise number of victims is a more challenging task.
For our second prediction task, creating a model for each region proved to be a good strategy. The accuracy of the models found is around 0.80.
Relating the terrorism with demographic data is a more difficult prediction task. We are still searching for a demographic dataset that fits with our Terrorism Dataset.

Wiki Status - concluded

Index

###1 - Number of kills prediction ###2 - Terrorist group prediction ###3 - Weekly Presentation ###4 - Perceived Feedback

Weekly work

1 - Number of kills prediction

To start the task of predicting the number of kills in a terrorist attack we checked the frequencies for this variable (Fig. 1). The majority of the attacks does not have mortal victims.

Figure 1 - Frequencies of the number of kills

When we tried to conceive a model for prediction we were facing some problems:

A lot of incidents with 0 or 1 mortal victim. This was a problem because in this case, the model tended to always guess with these values, which implied a large amount of error.
few attacks were having a lot of victims (e. g. less than 1% of the attacks has more than 32 victims). Therefore we consider >32 victims outliers and not regarded in our model.

In order to simplify our problem we decided to divide this problem into two subtasks:

Predict if the attack has no victims (0 victims); low victim number [1;32[; or high victim number [32;inf[.
Predict the number of victims when it is in the low category's [1;32[.

With this approach, we need to develop two models, one for each subtask as we can see in Fig. 2.

Figure 2 - Strategy for the number of kills prediction.

We then started the first task.

1) Task 1: Are there mortal victims?

As we can see (Fig.3) the number of attacks "with no victims" and "low victim number" is similar. However, there is only a minority of attacks with a "high victim number". Because of this, we aim to develop a model that performs well for "with no victims" and "low victim number". We consider attacks with more than 32 victims outliers.

Figure 3 - Frequencies of the number of kills

We build the model for this classification problem. The algorithm that achieved the best solution was Random Forest. We can see in Fig. 4 and 5 the performance of this algorithm.

Figure 4 - Performance of random forest

Figure 5 - Confusion matrix with random forest

We can conclude that the obtained model is good to solve this problem (e. g. has an accuracy of 80%). Besides, we should also point out that it does not perform well for attacks with a high number of victims. However, this is not problematic in our point of view, due to the low frequency of this type of attacks.

2) Predict the number of victims when it is a "low victim number"

For this specific task we used different algorithms, in the Fig. 4 we can see that the linear regression was the one that performed better.

Figure 4 - Metrics obtained when performed Linear Regression, Multilayer Perceptron and RepTrees.

In the Fig. 5 we can see that, despite linear regression being the algorithm with better performance, the results are far from ideal.

Figure 5 - Comparison between Real value (x axis) and predicted value (y axis) using Linear Regression.

In the next week, we will try to find a way to improve this model (e. g. Feature Extraction or Selection)

2 - Terrorist group prediction

During this week we also started researching on the prediction of the terrorist group, given the characteristics of the attack. This prediction task has 2 motivations behind it:

Terrorist groups may not claim responsibility right away (or even never): our model can help tracking terrorist groups before they claim responsibility, in order (for example) for the police to track faster active cells of that terrorist group and catch potential perpetrators preparing to be on the run.
Terrorist groups may claim responsibility falsely: our model can help detect terrorist groups that claim attacks they were not responsible for (which they may advertise for propaganda purposes) and help to find the real perpetrators.

2.1 - Methodology Definition

For this prediction task, we dropped all attacks that were marked as doubtful as whether they should be considered terrorist attacks or not. We also made a manual selection of the "real terrorist groups" by using search engines (for example, "Palestine" should not be considered a terrorist group) and we kept terrorist groups that were marked as "Other".

The top 10 most frequent groups worldwide are represented in the table below:

Terrorist group	Incidents
Other	14695
Taliban	4762
Shining Path (SL)	4134
Islamic State of Iraq and the Levant (ISIL)	2372
Farabundo Marti National Liberation Front (FMLN)	2129
Revolutionary Armed Forces of Colombia (FARC)	2037
Basque Fatherland and Freedom (ETA)	1903
Irish Republican Army (IRA)	1884
Boko Haram	1700
Communist Party of India - Maoist (CPI-Maoist)	1612

During this prediction task we used the following features:

Year of the attack (Numeric)
Existence of multiple attacks (Boolean)
Success or not of the attack (Boolean)
Existence of suicide regarding one or more perpetrators (Boolean)
Number of perpetrators (Numeric)
Number of perpetrators captured (Numeric)
Number of killings (Numeric)
Number of wounded (Numeric)
Number of hostages kidnapped (Numeric)
Asked ransom amount (Numeric)
Paid ransom amount (Numeric)
Number of released hostages (Numeric)
Type of the attack (Categorical)
Type of the target (Categorical)
Nationality of the victims (Categorical)
Type of the weapon (Categorical)
Subtype of the weapon (Categorical)

The categorical features were encoded using One-hot encoding (meaning they were transformed in the boolean variables).

In order to solve this prediction problem, we divided the dataset according to the geographic region of the attack (12 in total) and modeled the problem as a multi-classification problem by building a different model for each regional sub-dataset and considering 5 different classes as the 5 terrorist groups with more incidents in each region, plus one class that includes all the other groups in each region ("Other").

We then trained a Random Forest model (using a One-vs-All Multiclass Classifier) for each geographical region (using the Training (60%), Validation (20%), Test (20%) methodology).

2.2 - Results

The accuracies over the Validation set are presented below::

Geographical region	Accuracy fraction	Accuracy value
Sub-Saharan Africa	1040/1121	0.92774308653
Middle East & North Africa	1488/1655	0.899093655589
South Asia	2058/2323	0.885923374946
Central America & Caribbean	681/770	0.884415584416
North America	179/224	0.799107142857
Western Europe	1141/1433	0.796231681786
Southeast Asia	530/675	0.785185185185
South America	1707/2177	0.784106568672
Australasia & Oceania	9/12	0.75
Eastern Europe	27/43	0.627906976744
Central Asia	6/10	0.6
East Asia	17/32	0.53125

Given the high accuracy of the model when predicting terrorist groups in South Asia, as well as the high number of correctly predicted instances, we decided to study further the model in this geographic region.

2.2.1 - South Asia study case

First, it's important to analyze the frequencies of the different classes in this region, which is presented below:

Terrorist group	Incidents
Taliban	4762
Other	2790
Communist Party of India - Maoist (CPI-Maoist)	1612
Liberation Tigers of Tamil Eelam (LTTE)	1085
Tehrik-i-Taliban Pakistan (TTP)	990
People's Liberation Front (JVP)	377

It's noticeable from this table that the classes "Taliban" and "Other" are the most dominant ones. An analysis of the confusion matrix (on the figure below) reveals that the model is balanced even with the class unbalance here.

Figure 6 - Confusion matrix of the predicted group.

3 - Weekly Presentation

https://docs.google.com/presentation/d/1uW9iSe1YXwRk-u6SEjXpkTEjhZR53Nzyc7-9dc9C8AI/edit?usp=sharing

4 - Perceived Feedback

Get better evaluation metrics for regression.
Use weight functions in the classifiers or balance the bins of deaths (using over or under sampling).
Maybe use the country in which the attack took place to predict the group.