Data understanding - Achaad/data-mining-terrorism GitHub Wiki

Gathering data

Data Requirements

  • No direct requirements for the content as our goal is to explore the given data.
  • Time range: 1970 - 2017
  • Data formats: .csv

Data availability

Data is available at https://www.kaggle.com/START-UMD/gtd and is provided by National Consortium for the Study of Terrorism and Responses to Terrorism (START).

Selection criteria

From all the fields stated in section 2.2 we exclude the following: INT_LOG, INT_IDEO, INT_MISC, INT_ANY, scite1, scite2, scite3. dbsource

Data description

Source: National Consortium for the Study of Terrorism and Responses to Terrorism (START).

Format: comma-separated .csv file

Number of cases: 181691

Number of fields: 125

Features:

  • Case ranges:
  • Describing data
  • Source: National Consortium for the Study of Terrorism and Responses to Terrorism (START).
  • Format: comma-separated .csv file
  • Number of cases: 181691
  • Number of fields: 125
  • Fields:
  • eventid - GTD ID. first 8 numbers is a date recorded ‘yyyymmdd’, last 4 numbers are a sequential number for the given day
  • iyear - Year of the incident
  • imonth - Month of the incident
  • iday - day of the incident
  • approxdate - Approximate date of the event when the exact one is unknown
  • extended - Boolean that denotes whether the incident happened in one day or several
  • resolution - Date of the resolution of the incident if it was extended
  • summary - textual description of the incident
  • crit1 - Boolean that denotes whether the reason for incident is political, economic, religious, social goal
  • crit2 - Boolean that denotes whether the reason for incident is intention to coerce or publicise to larger audience,
  • crit3 - Boolean that denotes whether the reason for incident is outside international humanitarian law
  • doubtterr - Denotes if there is a doubt that the incident meets all the criteria for inclusion
  • alternative - Applies only if the incident does not meet all the criteria. 1 -> Insurgency; 2 -> Other Crime Type, 3 -> Inter/Intra-Group Conflict; 4 -> Lack of Intentionality; 5 -> State Actors. This field is presently only systematically available with incidents occurring after 1997.
  • multiple - Boolean that denotes whether the incident is a part of multiple incident
  • related - > List of the id’s of the related incidents
  • country - Codes of the countries
  • country_txt - String with the name of the country
  • region - Codes of the regions
  • refion_txt - String with the name of the region
  • provstate - Administrative region where the incident occured
  • city - Name of the city where incident occured
  • vicinity - Boolean that denotes whether the incident occured in the vicinity of the city
  • location - Additional information about the location
  • longitude - Longitude of the incident
  • latitude - Latitude of the event
  • specifity - 1 = event occurred in city/village/town and lat/long is for that location; 2 = event occurred in city/village/town and no lat/long could be found, so coordinates are for centroid of smallest subnational administrative region identified; 3 = event did not occur in city/village/town, so coordinates are for centroid of smallest subnational administrative region identified; 4 = no 2nd
  • order or smaller region could be identified, so coordinates are for center of 1st order administrative region; 5 = no 1st order administrative region could be identified for the location of the attack, so latitude and longitude are unknown
  • attacktype1 - General method of attack. 1 = assassination; 2 = armed assault; 3 = bombing/explosion; 4 = hijacking, 5 = hostage taking (barricade incident), 6 = hostage taking (kidnapping), 7 = facility/infrastructure attack, 8 = armed assault, 9 = unknown
  • attacktype1_txt - String with the attack type
  • attacktype2 - Same as attacktype1
  • attacktype3 - Same as attacktype1
  • success - Whether attack was successful or not
  • suicide - Whether the there is evidence that the perpetrator did not intend to escape from the attack alive.
  • weapontype1 - Type of the weapon
  • wepaontype1_txt - String with the weapon type
  • weapsubtype1 - Weapon subtype
  • weapsubtype1_txt - String with the weapon subtype
  • weapontype2 - Type of the weapon
  • wepaontype2_txt - String with the weapon type
  • weapsubtype2 - Weapon subtype
  • weapsubtype2_txt - String with the weapon subtype
  • weapontype3 - Type of the weapon
  • wepaontype3_txt - String with the weapon type
  • weapsubtype3 - Weapon subtype
  • weapsubtype3_txt - String with the weapon subtype
  • weapontype4 - Type of the weapon
  • wepaontype4_txt - String with the weapon type
  • weapsubtype4 - Weapon subtype
  • weapsubtype4_txt - String with the weapon subtype
  • weapondetail - Textual description of the weapon
  • targtype1 - Type of the target
  • targtype1_txt - String with the type of the target
  • tarsubtype1 - Subtype of the target
  • targtype1_txt - String with the subtype of the target
  • corp1 - Name of the corporate entity targeted
  • target1 - Name of the specific target
  • natlty1 - Nationality of the target
  • natlty1_txt - Nationality of the target
  • targtype2 - Type of the target
  • targtype2_txt - String with the type of the target
  • tarsubtype2 - Subtype of the target
  • targtype2_txt - String with the subtype of the target
  • corp2 - Name of the corporate entity targeted
  • target2 - Name of the specific target
  • natlty2 - Nationality of the target
  • natlty2_txt - Nationality of the target
  • targtype3 - Type of the target
  • targtype3_txt - String with the type of the target
  • tarsubtype3 - Subtype of the target
  • targtype3_txt - String with the subtype of the target
  • corp3 - Name of the corporate entity targeted
  • target3 - Name of the specific target
  • natlty3 - Nationality of the target
  • natlty3_txt - Nationality of the target
  • gname - Name of the group that carried out the attack
  • gsubname - Additional details of the name of perpetrator
  • gname2 - Name of the group that carried out the attack
  • gsubname2 - Additional details of the name of perpetrator
  • gname3 - Name of the group that carried out the attack
  • gsubname3 - Additional details of the name of perpetrator
  • guncertain1 - Whether the information reported by sources about the perpetrator is based on speculation
  • guncertain2 - Whether the information reported by sources about the perpetrator is based on speculation
  • guncertain3 - Whether the information reported by sources about the perpetrator is based on speculation
  • individual - Whether the attack was carried out by individual
  • nperps - Denotes the number of perpetrators
  • nperpcap - Number of Perpetrators captured
  • claimed - Whether there was a claim of responsibility
  • claimmode - Mode for claim of responsibility
  • claimmode_txt - String. mode for claim of responsibility
  • compclaim - Whether there were claims for responsibility
  • claim2 - Second group
  • claimmode2 - Second group
  • claim3 - Third group
  • claimmode - Third group
  • motive - Motive for the incident, systematically available only since 1997
  • nkill - Total number of fatalities
  • nkillus - Total number of U.S. citizens killed in attack
  • nkillter - Number of Perpetrator fatalities
  • nwound - Total number of injured
  • nwoundus - Total number of U.S. citizens injured
  • nwoundte - Number of perpetrators injured
  • property - Whether the property was damaged
  • propextent - Extent of property damage
  • propextent_txt - Extent of property damage
  • propvalue - U.S. dollar amount of property damage
  • propcomment - Non-monetary damage may be described here
  • ishostkid - Whether the victims were taken hostage or kidnapped
  • nhostkid - Total number of hostages/kidnapping victims
  • nhostkidus - Number of U.S. hostages
  • nhours - Length of kidnapping or hostage incident in hours
  • ndays - Length of kidnapping or hostage incident in days
  • divert - Country that kidnappers diverted to
  • kidhijcountry - Country of kidnapping resolution
  • ransom - Whether the incident involved a demand of monetary ransom
  • ransomamt - The amount of U.S. dollars demanded
  • ransomamtus - Ransom amount demanded from U.S. sources
  • ransompaid - Total ransom amount paid
  • ransomnote - Specific details about ransom
  • hostkidoutcome - Kidnapping outcome
  • hostkidoutcome_txt - Kidnapping outcome
  • nreleased - Number of hostages released/escaped/rescued
  • addnotes - Additional notes
  • INT_LOG - whether the attack was logistically international
  • INT_IDEO - Whether the attack was ideologically international
  • INT_MISC - Whether the attack was miscellaneous international
  • INT_ANY - Whether the attack was international on any of the dimensions
  • scite1 - Sources of citation
  • scite2 - Sources of citation
  • scite3 - Sources of citation
  • dbsource - The original data collection effort

Descripion of the codes is available at https://www.start.umd.edu/gtd/downloads/Codebook.pdf Data includes all the necessary information. It has more data than required so it has to be filtered. The amount of cases is big enough to make analysis. Data for year 1993 is missing, however some simple county-level statistics is available at https://www.start.umd.edu/gtd/downloads/Codebook.pdf on pp. 63-64

Exploring data

  • Total number of incidents is 181691
  • iyear - mean year of all incidents is 2002.64 and the median is 2009. This implies that half of all * the recorded incidents occurred after the year 2009.
  • imonth & iday - all values seem to be normal and indicate no errors
  • approxdate - total number of incidents which do not have a certain date of occurrence is 9239 and only 2244 incidents have unique values
  • extended - 8239 of all the incidents lasted longer than a day
  • country - incidents occurred in 205 unique countries
  • latitude and longitude - these values are known only for 177135 incidents.
  • crit1 - only 2084 incidents do not satisfy the first criteria
  • crit2 - 1255 incidents do not satisfy second criteria
  • crit3 - 22590 incidents do not satisfy third criteria
  • crit1, crit2, crit3 - there are no incidents that satisfy none of these criteria
  • doubtterr - 13874 incidents occured before year 1997 and this information was not collected about them. Of the remaining ones 299001 incidents might not meet all the required criteria. Possibly that they are not acts of terrorism and should be avoided while analysing the whole data
  • multiple - this statistic was collected only after the year 1997. 25032 incidents are linked to other ones
  • success - only 20059 incidents out of all were unsuccessful
  • motive - statistics collected only after the year 1997. Only 50561 incidents have explicitly stated motives. Is some data missing?
  • nperps - If the amount of perpetrators is unknown the value of the field is -99. This statistic is known only for 28358 incidents. Is some data missing? The maximum value is 25000 which seems to be impossible. Is data incorrect? Cannot trust the mean without dealing with the outlier. There is an incident with the value -9 which also seems to be incorrect.
  • nperpcap - If the amount of captured perpetrators is unknown the value of the field is -99. As in the above there are incidents with the value -9 which seems to be incorrect. The highest number of captured perpetrators is 406 which should be researched as the amount is too high to be blindly trusted.
  • nkill - Statistics is available only for 171378 incidents. The highest amount of killed victims is 1570.

Verifying data quality

Not all the required data is present. A lot of statistics have not been collected for the year 1993. In addition, as data collection methods have improved in the 2010s, there is significantly more data for the last seven years, compared to the previous period. Because of this, we may need to perform separate analysis for different time periods. In some fields there are outliers which should be removed. The incidents with these outliers should be examined closely to avoid considering some of the actual data for the outlier. Some of the data has the codes which have no meaning in the context of the field and they are probably errors (-9 instead of -99). Before the data analysis these values should be corrected. The data has many statistics which are unnecessary for our goals and as there are more than 180000 incidents recorded some of the research algorithms could incur a heavy load on the machine. We should truncate the data to avoid this problem.