Week 10 (W05 Feb01) Terrorism Database - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Summary & Index:

In this week we continued the predictive phase of our work. The main conclusions achieved are:

  • We were able to achieve good results in the terrorist group prediction task and with our model we concluded that 50% of the attacks in the Middle East may been conducted by ISIS and Al-Qaida.
  • We also achieved good results in the demographic data prediction task.

Wiki status: Concluded

Weekly work

Index

###1 - Number of kills prediction ###2 - Terrorist group prediction ###3 - World bank dataset ###4 - Weekly Presentation ###5 - Perceived Feedback

1 - Number of kills prediction

We concluded that it would be better if we develop two models for prediction of the number of victims as we can see in Fig. 1.

Figure 1 - Strategy for the number of kills prediction.

ScreenShot

We continue with this prediction task, focusing on the second part: predicting the number of victims. For this subtask, we are only considering the attacks with victims in the range [1;32[. In the previous week, we already concluded that it would be important to make a more balanced dataset, with buckets and equal frequencies of this buckets.

In the present week we decided to add three different features related with demographic data. The features are:

  • population
  • fuel importation
  • income

We concluded that with this variables in the model the results got slightle better, as we can see in the Table 1.

Table 1 - Comparison between the different strategies
Linear Regression Evaluation first approach bucket limit demographic features
Correlation coefficient 0.39 0.45 0.39 0.47
Mean Absolute error 2.54 5.32 0.39 5.31
Root mean sq. error 4.10 6.96 6.90 6.92
Relative absolute error 88.53% 84.18% 89.11% 83.52%
Root relative sq. error 92.82% 89.62% 92.07% 88.91%

In this week, other experiments were conducted. We tried to work with a smaller sample, which was a professor recommendation. Sometimes this can have a better result due to the more appropriate balance between sample size and number of features. We can conclude that for this prediction task we were not able of improving the results.

2 - Terrorist group prediction

During the previous week we studied how the accuracy of our classifier varied when we tried to predict among more different terrorist groups, and we concluded that in most of the regions (8) the values stabilized as we increased this group number. Since the accuracy values did not drop substantially, we considered the final classifiers the ones with the number of terrorist groups (classes) being 50.

In the table below we can see the comparison regarding the test set accuracy and the validation set accuracy (the latter one was the one used to select the number of terrorist groups (classes). We excluded the 4 regions covered in the previous wiki entry ("Australasia & Oceania", "Eastern Europe", "Central Asia" and "East Asia) in further analysis because these regions had a low number of attacks.

Region Accuracy (validation) Accuracy (test)
Central America & Caribbean 0.837662 0.800259
Middle East & North Africa 0.760121 0.778852
North America 0.517857 0.575893
South America 0.762977 0.741047
South Asia 0.813173 0.813683
Southeast Asia 0.694815 0.733333
Sub-Saharan Africa 0.892953 0.892061
Western Europe 0.720167 0.754533

With this prediction task completed, we want to fulfill one of the goals of this prediction task: can we predict the terrorist group of attacks where the group is 'unknown'? In the table below we have the number of attacks if which the responsibility of the attack is 'Unknown' and we can see that in the Middle East & North Africa is the region with the higher number of attacks:

Region Number of attacks without responsibility
Middle East & North Africa 21488
South Asia 16534
South America 4814
Southeast Asia 4336
Western Europe 4304
Sub-Saharan Africa 4222
Central America & Caribbean 2807
Eastern Europe 2585
North America 656
East Asia 393
Central Asia 348
Australasia & Oceania 135

Given this information, we decided to try to predict the responsibility of these attacks, as 72% of the attacks in this region don't have any responsibility claimed yet. The table below represents the top 5 groups with higher percentage of attacks in this region:

Group name Percentage of attacks
Unknown 0.721994
Islamic State of Iraq and the Levant (ISIL) 0.079699
Kurdistan Workers' Party (PKK) 0.037430
Al-Qaida in Iraq 0.019488
Al-Qaida in the Arabian Peninsula (AQAP) 0.017808
Hamas (Islamic Resistance Movement) 0.010517
... ...

The table below represents the top 10 terrorist attacks with higher percentages of predicted attacks with our model (this percentages are related to the attacks with 'Unknown' responsibility):

Group name Percentage of attacks predicted from the 'Unknown' attacks
Islamic State of Iraq and the Levant (ISIL) 0.313105
Al-Qaida in Iraq 0.216121
Islamic State of Iraq (ISI) 0.103174
Kurdistan Workers' Party (PKK) 0.042349
Al-Qaida in the Arabian Peninsula (AQAP) 0.034298
Hezbollah 0.031087
Sinai Province of the Islamic State 0.024944
Tripoli Province of the Islamic State 0.023501
Armed Islamic Group (GIA) 0.023362
Hamas (Islamic Resistance Movement) 0.020011
... ...

It is possible to notice that our model predicts that 50% of the attacks with 'Unknown' responsibility were either conducted by the Islamic State or Al-Qaida: two of the groups higher number of attacks in the region.

3 - World bank dataset

In order to add some demographic data to our dataset, we found data concerning the number of people living in a country, the fuel imports and also the national income. The data we found is provided by World Bank. For the purpose of our work, we subselected the information between 1970 and 2015. Concerning this new features we tried to improve the model in the task of number of kills prediction, as we already explained in the chapter 1.

With this new dataset we tried a new prediction task. We want to understand which features contribute more for the number of terrorist attacks in a country. For this prediction task we used as features:

  • Year
  • country name
  • income
  • population
  • fuel imports

We used as independent variable:

  • number of terrorist attacks in the respective year and country

For compute this metric we took into consideration that some countries in some years do not have terrorist attacks. In this case we allocate 0 number of terrorist attacks to the pair country-year.

3.1 - Descriptive analyses

We concluded that the frequencies for the number of terrorist attacks in a country are also unbalanced as we can see in the Figure 1.

Figure 1 - Frequencies in the number of attacks.

ScreenShot

We decided to follow a similar strategy than in the prediction of the number of kills. We divide this prediction task in two subtasks, as we can see in the Figure 2.

Figure 2 - Divide and conquer strategy followed to tackle the number of attacks prediction task.

ScreenShot

3.2 - Subtask 1 - Predict if a country in a particular year has terrorist attacks

Predicting if a country in a particular year has terrorist attacks is a classification problem. In this case we are dealing with only two classes: either the country in a particular year has terrorist attacks, either not. For this subtask, we considered the features already presented in the previous chapter. We used different models and, using random forests we achieved the confusion matrix that is presented on table 2.

Table 2 - Confusion matrix
classified as -> zero attacks some attacks
zero attacks 8037 665
some attacks 682 2742

Several performance metrics were collected in the table 3.

Table 3 - Model evaluation

| Class | TP Rate | FP Rate | Precision | Recall | F-Measure | ROC Area | |---------------|--------------|--------------|-----------|-----------|-----------|-----------|-----------| | zero attacks | 0.924 | 0.199 | 0.922 | 0.924 | 0.923 | 0.943 |
| some attacks | 0.801 | 0.076 | 0.805 | 0.801 | 0.803 | 0.943 |
| Weighted Avg. | 0.889 | 0.165 | 0.889 | 0.889 | 0.889 | 0.943 |

We concluded that this model has a different performance for both classes, however even in the case of the more uncommon class, we have 0.80 of true positives rate, therefore using random forest for this problem seems to be a good solution.

3.3 - Subtask 2 - Predict the number of terrorist attacks

After excluding the number of countries that do not have terrorist attacks in a certain year, we end up with around 3000 instances. In this case we use several different algorithms. We got the best results with the REPTree algorithm. We present it on the table 4.

Table 4 - Model evaluation
Metrics Value
Correlation coefficient 0.76
Mean Absolute error 28.16
Root mean sq. error 92.59
Relative absolute error 53.14%
Root relative sq. error 65.04%

Besides, we also analyse which feature is contributing more for the number of terrorist attacks. We used a method for attributes seaching. The name of the attribute evaluator is CfsSubset Eval with the search method Exhaustive Search (Weka). The selected attributes were 3:

  • Year
  • Country_Name
  • fuel_import

Indeed, after removing the other not selected features (income and population) the performance results remained almost the same, as we can see in table 5.

Table 5 - Model evaluation
Metrics Value
Correlation coefficient 0.76
Mean Absolute error 27.79
Root mean sq. error 92.42
Relative absolute error 52.44%
Root relative sq. error 64.92%

4 - Weekly Presentation

https://docs.google.com/presentation/d/1yvVxcnDlVI-MIBIPhlDqLBR9q2qxFCbE-iUmk1Cuhu4/edit?usp=sharing

5 - Perceived Feedback

  • ...