Week 10 (W05 Feb01) Terrorism Database - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Summary & Index:

In this week we continued the predictive phase of our work. The main conclusions achieved are:

We were able to achieve good results in the terrorist group prediction task and with our model we concluded that 50% of the attacks in the Middle East may been conducted by ISIS and Al-Qaida.
We also achieved good results in the demographic data prediction task.

Wiki status: Concluded

Weekly work

Index

###1 - Number of kills prediction ###2 - Terrorist group prediction ###3 - World bank dataset ###4 - Weekly Presentation ###5 - Perceived Feedback

1 - Number of kills prediction

We concluded that it would be better if we develop two models for prediction of the number of victims as we can see in Fig. 1.

Figure 1 - Strategy for the number of kills prediction.

We continue with this prediction task, focusing on the second part: predicting the number of victims. For this subtask, we are only considering the attacks with victims in the range [1;32[. In the previous week, we already concluded that it would be important to make a more balanced dataset, with buckets and equal frequencies of this buckets.

In the present week we decided to add three different features related with demographic data. The features are:

population
fuel importation
income

We concluded that with this variables in the model the results got slightle better, as we can see in the Table 1.

Table 1 - Comparison between the different strategies

Linear Regression Evaluation	first approach	bucket	limit	demographic features
Correlation coefficient	0.39	0.45	0.39	0.47
Mean Absolute error	2.54	5.32	0.39	5.31
Root mean sq. error	4.10	6.96	6.90	6.92
Relative absolute error	88.53%	84.18%	89.11%	83.52%
Root relative sq. error	92.82%	89.62%	92.07%	88.91%

In this week, other experiments were conducted. We tried to work with a smaller sample, which was a professor recommendation. Sometimes this can have a better result due to the more appropriate balance between sample size and number of features. We can conclude that for this prediction task we were not able of improving the results.

2 - Terrorist group prediction

During the previous week we studied how the accuracy of our classifier varied when we tried to predict among more different terrorist groups, and we concluded that in most of the regions (8) the values stabilized as we increased this group number. Since the accuracy values did not drop substantially, we considered the final classifiers the ones with the number of terrorist groups (classes) being 50.

In the table below we can see the comparison regarding the test set accuracy and the validation set accuracy (the latter one was the one used to select the number of terrorist groups (classes). We excluded the 4 regions covered in the previous wiki entry ("Australasia & Oceania", "Eastern Europe", "Central Asia" and "East Asia) in further analysis because these regions had a low number of attacks.

Region	Accuracy (validation)	Accuracy (test)
Central America & Caribbean	0.837662	0.800259
Middle East & North Africa	0.760121	0.778852
North America	0.517857	0.575893
South America	0.762977	0.741047
South Asia	0.813173	0.813683
Southeast Asia	0.694815	0.733333
Sub-Saharan Africa	0.892953	0.892061
Western Europe	0.720167	0.754533

With this prediction task completed, we want to fulfill one of the goals of this prediction task: can we predict the terrorist group of attacks where the group is 'unknown'? In the table below we have the number of attacks if which the responsibility of the attack is 'Unknown' and we can see that in the Middle East & North Africa is the region with the higher number of attacks:

Region	Number of attacks without responsibility
Middle East & North Africa	21488
South Asia	16534
South America	4814
Southeast Asia	4336
Western Europe	4304
Sub-Saharan Africa	4222
Central America & Caribbean	2807
Eastern Europe	2585
North America	656
East Asia	393
Central Asia	348
Australasia & Oceania	135

Given this information, we decided to try to predict the responsibility of these attacks, as 72% of the attacks in this region don't have any responsibility claimed yet. The table below represents the top 5 groups with higher percentage of attacks in this region:

Group name	Percentage of attacks
Unknown	0.721994
Islamic State of Iraq and the Levant (ISIL)	0.079699
Kurdistan Workers' Party (PKK)	0.037430
Al-Qaida in Iraq	0.019488
Al-Qaida in the Arabian Peninsula (AQAP)	0.017808
Hamas (Islamic Resistance Movement)	0.010517
...	...

The table below represents the top 10 terrorist attacks with higher percentages of predicted attacks with our model (this percentages are related to the attacks with 'Unknown' responsibility):

Group name	Percentage of attacks predicted from the 'Unknown' attacks
Islamic State of Iraq and the Levant (ISIL)	0.313105
Al-Qaida in Iraq	0.216121
Islamic State of Iraq (ISI)	0.103174
Kurdistan Workers' Party (PKK)	0.042349
Al-Qaida in the Arabian Peninsula (AQAP)	0.034298
Hezbollah	0.031087
Sinai Province of the Islamic State	0.024944
Tripoli Province of the Islamic State	0.023501
Armed Islamic Group (GIA)	0.023362
Hamas (Islamic Resistance Movement)	0.020011
...	...

It is possible to notice that our model predicts that 50% of the attacks with 'Unknown' responsibility were either conducted by the Islamic State or Al-Qaida: two of the groups higher number of attacks in the region.

3 - World bank dataset

In order to add some demographic data to our dataset, we found data concerning the number of people living in a country, the fuel imports and also the national income. The data we found is provided by World Bank. For the purpose of our work, we subselected the information between 1970 and 2015. Concerning this new features we tried to improve the model in the task of number of kills prediction, as we already explained in the chapter 1.

With this new dataset we tried a new prediction task. We want to understand which features contribute more for the number of terrorist attacks in a country. For this prediction task we used as features:

Year
country name
income
population
fuel imports

We used as independent variable:

number of terrorist attacks in the respective year and country

For compute this metric we took into consideration that some countries in some years do not have terrorist attacks. In this case we allocate 0 number of terrorist attacks to the pair country-year.

3.1 - Descriptive analyses

We concluded that the frequencies for the number of terrorist attacks in a country are also unbalanced as we can see in the Figure 1.

Figure 1 - Frequencies in the number of attacks.

We decided to follow a similar strategy than in the prediction of the number of kills. We divide this prediction task in two subtasks, as we can see in the Figure 2.

Figure 2 - Divide and conquer strategy followed to tackle the number of attacks prediction task.

3.2 - Subtask 1 - Predict if a country in a particular year has terrorist attacks

Predicting if a country in a particular year has terrorist attacks is a classification problem. In this case we are dealing with only two classes: either the country in a particular year has terrorist attacks, either not. For this subtask, we considered the features already presented in the previous chapter. We used different models and, using random forests we achieved the confusion matrix that is presented on table 2.

Table 2 - Confusion matrix

classified as ->	zero attacks	some attacks
zero attacks	8037	665
some attacks	682	2742

Several performance metrics were collected in the table 3.

Table 3 - Model evaluation

| Class | TP Rate | FP Rate | Precision | Recall | F-Measure | ROC Area | |---------------|--------------|--------------|-----------|-----------|-----------|-----------|-----------| | zero attacks | 0.924 | 0.199 | 0.922 | 0.924 | 0.923 | 0.943 |
| some attacks | 0.801 | 0.076 | 0.805 | 0.801 | 0.803 | 0.943 |
| Weighted Avg. | 0.889 | 0.165 | 0.889 | 0.889 | 0.889 | 0.943 |

We concluded that this model has a different performance for both classes, however even in the case of the more uncommon class, we have 0.80 of true positives rate, therefore using random forest for this problem seems to be a good solution.

3.3 - Subtask 2 - Predict the number of terrorist attacks

After excluding the number of countries that do not have terrorist attacks in a certain year, we end up with around 3000 instances. In this case we use several different algorithms. We got the best results with the REPTree algorithm. We present it on the table 4.

Table 4 - Model evaluation

Metrics	Value
Correlation coefficient	0.76
Mean Absolute error	28.16
Root mean sq. error	92.59
Relative absolute error	53.14%
Root relative sq. error	65.04%

Besides, we also analyse which feature is contributing more for the number of terrorist attacks. We used a method for attributes seaching. The name of the attribute evaluator is CfsSubset Eval with the search method Exhaustive Search (Weka). The selected attributes were 3:

Year
Country_Name
fuel_import

Indeed, after removing the other not selected features (income and population) the performance results remained almost the same, as we can see in table 5.

Table 5 - Model evaluation

Metrics	Value
Correlation coefficient	0.76
Mean Absolute error	27.79
Root mean sq. error	92.42
Relative absolute error	52.44%
Root relative sq. error	64.92%

4 - Weekly Presentation

https://docs.google.com/presentation/d/1yvVxcnDlVI-MIBIPhlDqLBR9q2qxFCbE-iUmk1Cuhu4/edit?usp=sharing