Business understanding - Achaad/data-mining-terrorism GitHub Wiki
Identifying business goals
Background
GTD (Global Terrorism Database) is a dataset containing records of terrorist acts worldwide from 1970 to 2017. This dataset was collected by START (National Consortium for the Study of Terrorism and Responses to Terrorism) during several phases using public, non-classified high-quality source materials, including media articles and electronic news archives, and to a lesser extent, existing datasets and secondary source materials such as books and journals, and legal documents. In order for an event to be recorded in the GTD it must be documented by at least one such high-quality source. It is a large dataset, containing over 180,000 records with more than 120 features. This makes a rich dataset, which gives a multitude of possibilities to extract patterns and visualize historical and geographical information on plots and maps. We found this a very compelling subject which would give informative and interesting results and would provide a challenge with the amount of data that has to be processed and analyzed.
Business goals
- Identify important patterns in the data.
- Visualize the extracted historical and geographical information.
- Link the top 5 most interesting patterns to historical background.
Success criteria
Our project is not meant to benefit a business and holds only informative value to anyone interested in the results. Therefore, our success criteria are more connected to the knowledge gained from this project. We deem the project successful if
- We have managed to use 80% of the features in the analysis which we have planned to use originally.
- We have managed to extract interesting patterns.
- We have visualizations of important patterns.
Situation assessment
Inventory of resources
- The GTD data. This is our main dataset containing data for every year except 1993.
- The GTD+ data. This is the data for the year 1993 and is found at the end of GTD’s documentation file. This contains only seven features, which can be used for a smaller part of the analysis.
- Jupyter Notebook with Python. We will use Python to analyze the data.
- HPC. If necessary for the analysis, we have access to the university’s computational resources at HPC.
Requirements, assumptions, and constraints
The data is publicly available for non-commercial use, therefore it does not place any legal or security constraints on our work. We consider the requirements for acceptable finished work the business success criteria.
Risks and contingencies
- Risk: Running an algorithm on a PC takes more than 5 hours.
- Solution: We use HPC to run our script. Most of the essential and necessary Python libraries are available to use in HPC and we can download any missing ones.
- Risk: The analysis is biased due to the fact (1) that the 1993 data is partial, and (2) that the data collection processes in different phases affected the number of terrorist acts per year. The first year of data collected under the new process, 2012, represents a dramatic increase in the total number of worldwide terrorist attacks over 2011. Although this increase likely reflects recent patterns of terrorism, it is also partly a result of the improved efficiency of the data collection process.
- Solution: We exclude the 1993 data from a part of the analysis and do a separate analysis for years that fall under different phases of the data collection process to avoid any bias.
- Risk: Some of the data was not collected before the year 1997.
- Solution: We analyse features present only in the data after the year 1997 separately form the data before this year.
Terminology
- GTD - the Global Terrorism Database; the analyzed dataset.
- GTD+ - the additional data for 1993, found in the documentation.
- START - the National Consortium for the Study of Terrorism and Responses to Terrorism; owner and collector of the dataset.
- A terrorist attack - the threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation.
Costs and benefits
As we have no client for our project, there are no financial costs or benefits involved. This project would hopefully benefit anyone who wishes to gain more insight into the historical tendencies of terrorism worldwide, be it journalists, historians, teachers, students or politicians.
Data mining goals
Goals
- Visualise distribution of incidents in time
- Find different correlations inside the dataset
- Extract coordinates of the incidents
- Extract data about incidents that happened during some historical events
Success Criteria
We deem data mining successful if
- We have managed to find at least 3 different correlations inside the dataset
- We have extracted coordinates of 90% of incidents
- We have managed to extract data describing 5 major historical events