Week 02 (W47 Nov23) Crimes in the UK - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Week 02 (W47 Nov23) Crimes in the UK
Note: For unknown abbreviations and terms, please consider the glossary. If anything is missing, please just create an issue or write us an email and we will add it.
Content
Summary
During this week, we worked on exploring the crimes data and integrating new datasets. We learned more about i) the feature categories/values (e.g. the different outcome and crime types) ii) the process of data collection by UK Police, and iii) the accuracy of the data, especially the geo locations. In addition, we were able to complement the crimes data with the aggregated “stop and search” information and Points of Interests in the UK. Afterwards, we were able to conduct descriptive statistics about crimes with respect to newly integrated information to gain a better understanding of the crimes data and to explore correlations.
Results
This section presents descriptive statistics about each of the tables “Crimes” and “Stop and search” separately as well as information about further potential enrichment datasets. Salient properties will be highlighted with a different background color to serve as a finding and a reference for the team in later stages of the course.
Data in general
Location anonymization
In order to protect the privacy of the victims, the location of the crimes in data is anonymized and therefore “represents only an approximation of location of the crime and not the exact place itself. In the anonymization process, the exact crime locations were compared against a master list of 750k anonymous points (center points of streets, public place, etc.) to find the nearest map point. The coordinates of the actual crime are then replaced with the coordinates of the map point. If the nearest map point is more than 20km away, the coordinates are zeroed out” (Source).
Further Location inaccuracy
Inconsistent geocoding policies in police forces also affect the accuracy of the location data. Especially for crimes with unknown exact location (e.g. the victim is not sure about the location) “Estimates of geocoding accuracy in different forces range from 60% to 97%.” (Source)
Depending on the inaccuracy rate of the published locations, the anonymization process could have a negative effect on our approach of gaining more information about the environment around crimes by aggregating Stop-and-Searches and Points-of-Interest within 500m from the crimes Therefore, we contacted the UK Police and asked about the average corrected distance of all crime to the chosen.
Missing outcome data
Some Police forces like the British Transport Police and the Police Service of Northern Ireland don’t provide the Home Office with outcome data. However, there are plans for changing this in the near future.
Double counting of ASB and crime
UK Police suspects six police forces are duplicating certain types of ASB incidents in their uploads.
Court result matching
“There is no unique identifier for crimes that runs from the police service to the CPS and onwards to the Courts. This makes trying to track a crime through the whole Criminal Justice Service automatically almost impossible. We use a 'fuzzy matching' process to try and achieve this, with success rates between 19% and 97% depending on where in the country the crime happened” (Source)
Statistics
Crime type ~ average number of Stop-and-Searches and Points-of-Interest around 500m
Outcome type ~ average number of Stop-and-Searches and Points-of-Interest around 500m
Work Log
This section provides detailed information about the activities and tasks the team achieved this week in terms of our followed approach, faced challenges, implementation details, and justification for our decisions.
Overruled Neighborhood Statistics
We decided not to include the neighborhood statistics based on LSOA, as we found out that the neighborhood statistics are unfortunately not up to date (mostly till 2011).
Use of Points-of-Interest
As an enrichment to the crimes data, we found a source that provides around 191.000 POI all over the UK in CSV format, broken down by the different categories listed below [1]. We downloaded the dataset and read it into db using a python script.
POI categories:
- Wifi hotspot
- Fuel: Gas station
- Shopping consumer electronics: electrical retailer selling home electronics * and household appliances
- Transport: airports and train stations
- Accommodation: Hotels
- Entertainment: Casinos, Cinemas
- Banks
- Food and drink: restaurants, dining pups
- Government agency: Tax office, passport office
- Public service buildings: Police station, fire station
- Landmark: Mountains, hills, waterfalls
- Attraction: Aquarium, museums,
- Book shop
- Special Interest: Madame Tussaud, Church bell
- Community: churches, Universities
- Courier: DHL, Express
- Healthcare: Hospital, practice
- Sports center: sports clubs,
- Pharmacy
- Shopping supermarkets: Lidl, Aldi
- Association
- Holiday parks
- ATM
Computing Number of Stop-and-Searches and Points-of-Interest surrounding Crimes
In order to detect the nearest stop and search incidents and POIs within the range of 500m from crimes, we used PostGIS, a spatial database extender for PostgreSQL. Because of performance reasons, we used as distance metric the L-infinity norm, so we could build R-tree-indices with the coordinates of Stop-and-Search and Points-of-Interest points. Afterwards, we were able to perform a fast search of surrounding points for every crime entry - which still took many hours on our local machines.
Visualization with Mapbox
While searching for visualization approaches on the map, we came across the platform Mapbox. We were able to set up a Node.js server and attach the map to an html page. We also found an existing library for clustering geo points according to the zoom levels on the map and could paint about 50,000 points on a map. We still did not decide yet if we should go with Mapbox, as we are not sure whether an interactive map is the best visualization approach for our use case. In case yes, we still have to figure out whether mapbox would be the right tool for the job.
Future Plans
- Gather more information about the accuracy of the data in cooperation with the UK police
- Cluster counties into regions to conduct region specific statistics
- Data cleaning and preparation
- Implement an informative visualization approach
Links and Sources
[1] http://www.pocketgpsworld.com/modules.php?name=POIs
[2] https://www.cps.gov.uk/victims_witnesses/going_to_court/sentencing.html