Week 09 (W04 Jan25) Terrorism Database - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Summary & Index:

In this week we continued the predictive phase of our work. The main conclusions achieved are:

We were able to improve the prediction of number of kills, however we are still far from getting a good model.
The terrorist group prediction achieved good accuracy results, even when distinguishing between many possible terrorist groups in some regions.
We discover a dataset for demographic data that can help us in the prediction tasks.

Wiki status: concluded

Weekly work

Index

###1 - Number of kills prediction ###2 - Terrorist group prediction ###3 - World bank dataset ###4 - Weekly Presentation ###5 - Perceived Feedback

1 - Number of kills prediction

In the last week, we concluded that it would be better if we develop two models for prediction of the number of victims as we can see in Fig. 1.

Figure 1 - Strategy for the number of kills prediction.

We continue with this prediction task, focusing on the second part: predicting the number of victims. For this subtask, we are only considering the attacks with victims in the range [1;32[. In the previous week, we already concluded that it would be important to make a more balanced dataset in what concerns to the number of victims. We then decided to better analyze the frequencies distribution within this set (Fig. 2).

Figure 2 - Frequencies of the number of kills in the range [1;32[

We concluded that we should have a better balance between smaller and larger values, and see if this allows us to improve the model results.

We tested several different approaches:

Undersample each of these buckets and have 1000 instances in each.
Assure that each integer value has 500 or fewer instances in the dataset.

1.1 Undersample each bucket of data

To better understand this procedure, the achieved result is presented in the Fig. 3.

Figure 3 - Strategy of assuring that each folder is represented with 1000 instances.

Bucket	N
1	1000
[2;4[	1000
[4;8[	1000
[8;16[	1000
[16;32[	1000

The real frequencies achieved for each integer value are presented in the Fig. 4

Figure 4 - Real frequencies in the sample, after balancing the buckets.

1.2 Limit the number of repetitions for each integer value

In the previous section, we undersample the data in a way that some values were still much more represented than the other. We then decided to assure that we would also test a more uniform distribution. To better understand this procedure, the achieved result is presented in the Fig. 5.

Figure 5 - Real frequencies in the sample, after Limit the number of repetitions for each integer value.

In the next subchapter, we analyze the results achieved with this two different strategies.

1.3 Achieved results

As we can see in Figure 6, the achieved results overcome the ones we had in the previous week.

Figure 6 - Comparison between the different undersampling strategies

Linear Regression Evaluation	last week	1.1 - bucket	1.2 - limit
Correlation coefficient	0.39	0.45	0.39
Mean Absolute error	2.54	5.32	5.36
Root mean sq. error	4.10	6.96	6.90
Relative absolute error	88.53%	84.18%	89.11%
Root relative sq. error	92.82%	89.62%	92.07%

We concluded that the first approach (1.1 - bucket) allowed us to slightly improve the results from the last week. However, in the second one (1.2 - limit) that was not the case. After conducting these different experiments, we also tried with different algorithms (e. g. REPtree, Multilayer Perceptron), and also different Feature Selection techniques (e. g. PCA, ReliefF), without any improvement in the model. We then concluded that probably we do not have some features that would be important for this prediction task.

2 - Terrorist group prediction

The assumption behind the our solution for terrorist group prediction, as seen in the previous week, is that the most frequent terrorist groups (i.e. with higher number of incidents) in a region are the ones with higher probability of being responsible for a new attack in which the responsibility was not claimed by any group. With the assumption, we created a multi-class model for each region that could distinguish the responsibility of an attack between the 5 most frequent terrorist groups in that same region, and an extra group (which designated as "Other") that encapsulates all the remaining groups that operate in the region.

However, a question arose: can we adjust the number of terrorist groups that we can distinguish in each region, depending on the data we have? Maybe regions with lower volumes of data enable us to only distinguish the attacks' responsibility between 1 specific group or "Other", while areas with higher volumes of data allows us to make more specific predictions regarding the group.

In order to answer this question, we calculated the accuracy for each regional model by varying the number of terrorist groups distinguished by the model from 1 to 50. Figure 6 illustrates this plot.

Figure 6 - Plot of the model accuracy across different regions and varying the number of the terrorist groups to distinguish from.

In this previous plot it's noticeable two clusters of regions: one in which the accuracy values don't change much over the number of terrorist groups and another one in which they oscillate as the number of terrorist groups increases. We can identify that this latter cluster encapsulates the regions "Australasia & Oceania", "Eastern Europe", "Central Asia" and "East Asia". When comparing these results with the number of terrorist attacks in these regions (section 2.2 of the wiki from last week) we conclude that this oscillation seems to happen because of the low number of terrorist attacks in these regions (which will give accuracy values that depend very much on the split "training" - "validation" - "testing").

Figure 7 - Plot of the model accuracy across different regions (the ones from Figure 6 except "Australasia & Oceania", "Eastern Europe", "Central Asia" and "East Asia") and varying the number of the terrorist groups to distinguish from.

Figure 7 illustrates the same plot with the "oscillating cluster" removed and we can conclude there is a subtle decrease in the accuracy as we increase the number of groups (which is more accentuated in North America) until approx. 10 groups. From 10 groups on, the accuracy values seem to stabilize (except in North America).

The results obtained appear to be very good, as almost all of the regions in Figure 7 achieved an accuracy of at least 0.7, even when the number of groups is high.

3 - World bank dataset

In order to add some demographic data to our dataset, we found the demographic data concerning the number of people living in a country, the fuel consumption and also the national income, that is provided by World Bank. For the purpose of our work, we subselected the information between 1970 and 2015. Concerning this new features we have in mind we plan for the next week:

add this features to the nkill prediction regression model.
try some more exploratory analysis.

4 - Weekly Presentation

https://docs.google.com/presentation/d/1IjsYMErD8Y69cW_of0e7M-JQgaChe7Z5S5RQw7QWSzU/edit?usp=sharing