Week 01 (W46 Nov16) Crimes in the UK - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Week 01 (W46 Nov16) Crimes in the UK

Note: For unknown abbreviations and terms, please consider the glossary. If anything is missing, please just create an issue or write us an email and we will add it.

Content

Dataset - General Information
1. Data Organization
2. Data Description
Results
Ideas for Data Enrichment
Working Log
Plan for Next Week
Sources

Dataset - General Information

Data Organization

The dataset is structured in 70 folders. Each folder represents the crime data for a single month of the respective year, broken down by territorial police force. The data range starts from december 2010 till September 2016 . Each folder comprises the following three types of data sets for each police force:

Street-level crime
Outcome
Stop and search information

Data Description

The total size of the data is around 13.2 GB.
Format of the source files is csv.
Number of attributes:
- Street: 12
- Outcome: 10
- Stop and search: 14
- Crimes (Street joined with Outcome): 15
Number of rows:
- Street: 35,000,000
- Outcome: 17,623,000
- Stop and search: 584,000
- Crimes (Street joined with Outcome): 38,550,000

Results

This section represents descriptive statistics about each of the tables “Crimes” and “Stop and search” separately as well as information about further potential enrichment datasets. Salient properties will be highlighted with a different background colour to serve as a finding and a reference for the team in later stages of the course.

Data Quality with Respect to Outcome

Crimes without outcome type: 20,765,000 (53.87%)
Street without last_outcome_category: 18,535,000 (52.96%)
S&S without outcome: 17,000 (3.07%)

Crimes (Join of Outcome and streets)

Frequency per outcome type

Crime type ~ Frequency

Month ~ Frequency

Week day ~ Frequency

Countie ~ Frequency in percent

![Crimes with outcome] (https://github.com/chaoran-chen/data-mining-lab-ws1617-crimes-in-uk/blob/master/wiki/week-01/images/2016_09_drugs_robbery_violence.png)

Stop and search

Age range ~ frequency in percent

Gender ~ frequency in percent

Self defined ethnicity ~ frequency in percent

Ideas for Data Enrichment

Neighborhood statistics using LSOAs: Lower Layer Super Output Areas (LSOA) is code given by the office of national statistics to represent a wide range of geographical areas in the UK. It is used for reporting of small area statistics. One LSOA is around 4 - 6 output area siwh a population of around 1500. This website provides the following statistics about LSOAs:
- Education attainment, enrolment to higher education, number of students
- Health and care
- People and Society: Income and lifestyle
- Work deprivation
- ...
Points of Interest: The following website [3] provides datasets with different POIs in the UK like:
- ATM
- Attraction
- Bank
- Entertainment
- Food & Drink
- Holiday parks
- Transport
- …
Weather data: A possible attribute to be considered would be the weather on the day of the crime. Here are some links to potential apis and data for historical weather reports[4]

Working Log

This section provides detailed information about the activities and tasks the team achieved this week in terms of our followed approach, faced challenges, implementation details, and justification for our decisions.

Reading row data into database

We wrote a python script for reading the row data from the files structure described above into MySQL database. We encountered a problem with the first version of the script while reading tuples with cells which includes quotation marks. As a result we lost about 20.000 tuples. However, we were able to solve this problem in the latest version of the script by escaping the quotation marks in the cells and were therefore able to read the full dataset into the database.

Problems with MySQL

We faced performance issues while executing SQL queries for statistical purposes in MySQL. Furthermore we were not able to join the two tables “Outcome” and “street” on MySQL. As a potential solution for this issue, we tried to access database systems on the student server but we didn’t have the rights to use them.

Join of Outcome and Street

Given the absolute similarity between the two tables “Street” and “Outcome”, we decided to join the two tables to conduct more informative, redundancy free, statistics. After failing to execute the join on our machines with MySQL, we decided to give it a try with another Database system: It finally worked with PostgreSQL.

Drawing samples and then throwing them away

After failing to execute SQL queries on the database in MySQL, which we needed for conducting the statistics, we decided to draw smaller dataset to work with. The first sample was for only 6 police forces, the second was for London, and the last was for Cleveland. After solving the db performance issues by working with PostgreSQL, we decided not to use those samples but the whole dataset in order to conduct more representative statistics.

Visualization Tools

We used the data visualization tool named Qlik Sense for plotting the geo data of the crimes on the map. For the other graphs we used R and Excel graphs.

Plan for Next Week

Consolidate the final table
Data cleansing and preparation
Join “Crimes” and “Stop and search” datasets by aggregating information about the nearest “Stop and search” tuples located within a given range in km (not specified yet) from the respective crime
Write the python script for finding the nearest Stop&Search tuples
Further investigation of the possibility of including Neighborhood statistics using LSOAs in the final table
Further investigation of the possibility of including Points of Interest (POI) information in the final table

Link to presentation

Sources

[1] http://www.datadictionary.nhs.uk/data_dictionary/nhs_business_definitions/l/lower_layer_super_output_area_de.asp?shownav=1

[2] http://www.neighbourhood.statistics.gov.uk/dissemination/LeadHome.do%3Fa%3D3%26i%3D1001%26m%3D0%26r%3D1%26s%3D1298463767007%26enc%3D1%26extendedList%3Dtrue%26nav%3DA

[3] http://www.pocketgpsworld.com/tomtompoi.php

[4] http://www.metoffice.gov.uk/public/weather/climate-historic/#?tab=climateHistoric

[5] https://navlost.eu/#/metar